Author Topic: Cortex-M7 pipeline idiosyncrasies… (Read 8186 times)

tggzzz · « **Reply #50 on:** July 03, 2024, 09:31:42 pm »

Quote from: uer166 on July 03, 2024, 09:04:01 pm

Quote from: tggzzz on July 03, 2024, 08:35:35 pm
So far the best available new concepts is xC+xCORE, and possibly Rust.

I sincerely hope to you're getting paid some serious bucks to shill for them so much, seriously wasted potential otherwise.

I write about what interests me, and why it interests me.

Why do you write about things?

glenenglish · « **Reply #51 on:** July 03, 2024, 10:06:16 pm »

seems a bit discussion about not much except that all good embedded people know that if something has to happen on time, everytime, then it has to be hardware driven and you pick a processor that can keep up with the interaction with the hardware (producing or consuming samples or whatever).... CM7 provides best of both worlds with a mix of complex ugly C (flash/caches) , and lean and mean C/ASM Interrupt code (ITCM). I'm in the basket of mixing a small FPGA (in order to make hardware deadlines) with a softcore micro - if the CM7 cant do it.

tggzzz · « **Reply #52 on:** July 03, 2024, 10:16:56 pm »

Quote from: glenenglish on July 03, 2024, 10:06:16 pm

seems a bit discussion about not much except that all good embedded people know that if something has to happen on time, everytime, then it has to be hardware driven and you pick a processor that can keep up with the interaction with the hardware (producing or consuming samples or whatever).... CM7 provides best of both worlds with a mix of complex ugly C (flash/caches) , and lean and mean C/ASM Interrupt code (ITCM). I'm in the basket of mixing a small FPGA (in order to make hardware deadlines) with a softcore micro - if the CM7 cant do it.

With most MCUs, that is the case.

But the xCORE+xC+toolchain has many FPGA-like properties, especially parallelism and timing guarantees. It shifts the boundaries of what can be achieved in software without needing an FPGA.

SiliconWizard · « **Reply #53 on:** July 03, 2024, 10:32:51 pm »

While using interrupts requiring "extremely" low latency was common in the days were MCUs were slow and only had very basic peripherals, nowadays it rarely makes sense to rely on interrupt latency (as long as it's not ridiculously excessive) to ensure strict timings down to the µs or less. Modern MCUs almost all have chainable peripherals ("hardware triggers"), fancy PWM and whatnot, that would not require relying on interrupts for such tasks. Some even have programmable IO blocks, such as the RP2040's PIO and NXP's FlexIO, which can also do quite a bit. Use them, don't try to be clever with ISRs.

If that still doesn't cut it, you can of course consider using an FPGA - sometimes a small FPGA, linked to a MCU, will dramatically reduce your headaches while adding maybe just a couple bucks to the BOM. Worth it.

And then, yes, you otherwise have the xCORE stuff, which is great too. The only downside I personally see to them, which is not intrinsic, but just a market fact, is that going for xCORE will tie you up to a single vendor.

brucehoult · « **Reply #54 on:** July 04, 2024, 12:17:41 am »

Quote from: nimish on July 03, 2024, 09:02:58 pm

XMOS barrel processors are a neat little artifact from an alien planet. Interesting, but mostly a curiosity. I'd rather use a SoC-FPGA and a standard language over some tiny company's proprietary custom stuff anyway.

The wacky thing about XMOS is the instruction set and programming language, not that it's a barrel processor, which is a perfectly fine way to design a multi-core CPU if you care about total throughput not peak single-thread speed.

At last week's RISC-V Summit in Munich, Ben Abdelhamid, a postdoc at Heidelberg University, showed his "BRISKI" RISC-V RV32I barrel processor which implements up to 16 harts/core. It uses fewer than 800 LUTs and runs at 650+ MHz in a VU9P FPGA, with CPI = 1 (always .. no cache or branch predictor needed), achieving a pretty amazing 0.8 MIPS/LUT. It fits 1024 cores (16384 hardware threads) in that FPGA.

nimish · « **Reply #55 on:** July 04, 2024, 02:31:31 am »

Quote from: brucehoult on July 04, 2024, 12:17:41 am

Quote from: nimish on July 03, 2024, 09:02:58 pm
XMOS barrel processors are a neat little artifact from an alien planet. Interesting, but mostly a curiosity. I'd rather use a SoC-FPGA and a standard language over some tiny company's proprietary custom stuff anyway.

The wacky thing about XMOS is the instruction set and programming language, not that it's a barrel processor, which is a perfectly fine way to design a multi-core CPU if you care about total throughput not peak single-thread speed.

At last week's RISC-V Summit in Munich, Ben Abdelhamid, a postdoc at Heidelberg University, showed his "BRISKI" RISC-V RV32I barrel processor which implements up to 16 harts/core. It uses fewer than 800 LUTs and runs at 650+ MHz in a VU9P FPGA, with CPI = 1 (always .. no cache or branch predictor needed), achieving a pretty amazing 0.8 MIPS/LUT. It fits 1024 cores (16384 hardware threads) in that FPGA.

Barrel processors are fine but they are uncommon enough and require a re-think of how computation works, so they are wacky for general purpose use.

Not a great idea to shackle yourself to a single vendor either, if the chip shortages weren't enough convincing.

brucehoult · « **Reply #56 on:** July 04, 2024, 02:48:59 am »

Quote from: nimish on July 04, 2024, 02:31:31 am

Barrel processors are fine but they are uncommon enough and require a re-think of how computation works, so they are wacky for general purpose use.

Exactly the same re-think as any other computer with multiple cores (or anyway more than 2 or maybe 4), which has been common for quite some time.

The Mac Pro my employer provided me in 2009 had 8 cores. Another employer got me an i9 with 18 cores in late 2017. The Linux PC I built in early 2019 has a 32 core (64 thread) ThreadRipper. Macbook Pro laptops now come with up to 16 cores. My Lenovo laptop has an i9 with 24 cores (32 threads).

Pretty much everyone who wants to take full advantage of the computer they have now has to ask themselves "How can I divide work up into lots of threads?"

Interestingly, Apple was pushing their answer to using lots of cores in a single application -- Grand Central Dispatch -- a long time ago and it was built in to OS X Snow Leopard (2009) and iOS 4 (2010).

glenenglish · « **Reply #57 on:** July 04, 2024, 05:51:01 am »

Well, I decided to take a look at this famed XMOS product

I have an estemed colleague, who about 5 years ago, a hard SHARC DSP guy, experimented with XMOS and gave it away, went back to SHARC.

https://www.xmos.com/download/xCORE-Architecture(1.2).pdf

I think that that flyer is a whole load of snake oil marketing crap. and quite misleading the way its competition are portayed, and they're like comparing it against a snail.

and its written inferring you dont have to synchronize anything, just write code in lots of units and it will all just work.

L ooking,. naa it seems xmos is alive and well, and quite capable
https://www.xmos.com/xcore-ai
It reads well. But I am sure the devil is in the detail. Or is it another modern day take on a GPU architecture ?
Seems the multicore - is not really multicore- more like cores with hyperthreading x 8 per core.
Someone must have bought their chips ?!

tggzzz · « **Reply #58 on:** July 04, 2024, 08:29:54 am »

Quote from: brucehoult on July 04, 2024, 12:17:41 am

Quote from: nimish on July 03, 2024, 09:02:58 pm
XMOS barrel processors are a neat little artifact from an alien planet. Interesting, but mostly a curiosity. I'd rather use a SoC-FPGA and a standard language over some tiny company's proprietary custom stuff anyway.

The wacky thing about XMOS is the instruction set and programming language, not that it's a barrel processor, which is a perfectly fine way to design a multi-core CPU if you care about total throughput not peak single-thread speed.

At last week's RISC-V Summit in Munich, Ben Abdelhamid, a postdoc at Heidelberg University, showed his "BRISKI" RISC-V RV32I barrel processor which implements up to 16 harts/core. It uses fewer than 800 LUTs and runs at 650+ MHz in a VU9P FPGA, with CPI = 1 (always .. no cache or branch predictor needed), achieving a pretty amazing 0.8 MIPS/LUT. It fits 1024 cores (16384 hardware threads) in that FPGA.

Barrel processors have a lot of advantages for "embarassingly parallel" problems. Basically you can avoid the significant silicon real-estate used by caches, prediction and super-scalar operation, and use it to provide many simple cores.

Sun used that to great advantage in their UlteaSPARC T-series processors for server-side operations. When I used them for a soft real-time telecoms server system, other engineers were stunned by the (easily achieved) performance. Shame Big Red borged them.

It is easy to produce fast parallel hardware; many companies have done it. Where they have fallen down is normally the software - especially when they expect people to be able to write correct parallel C/C++!

The key point about XMOS is not the processor (and who cares about the instruction set[1]) but the way the cores, switch fabric, peripherals, language and toolchain have been integrated into a remarkably pleasant ecosystem.

CSP starts with the presumption that processing is parallel, and has a decent theoretical basis. Unsurprising when it was invented by Tony Hoare (QuickSort, semaphores, NULL, Turing Award), and its concepts have been continually implemented (TMS320, Go, etc)
xC is C with the bits that make parallelism "difficult" removed (e.g. aliasing), and added RTOS and CSP parallelism constructs
the Eclipse-based IDE takes full advantage of everything being integrated, compiles the xC (and C/C++ ugh) code, then examines the optimised binaries to determine the min/max times of all paths from here to there

We urgently need hardware+software ecosystems that are based on parallelism from the ~~ground~~ silicon up to application. 1970s tech with a few warts bolted on 40 years too late[2] simply won't cut it in the future. So far xC+xCORE+IDE are the sole demonstration of a better practical system. Rust has promise, but is too far from the hardware. We need more highly parallel ecosystems.

[1] they have even had chips where one of the cores was an ARM processor. The different processor was fully integrated into the hardware+xC+IDE ecosystem.

[2] start with the concept of a memory model, and never forget that C defined parallelism as being part of the library - and explicitly avoided providing the properties that would enable a library to implement parallel functions in C

tggzzz · « **Reply #59 on:** July 04, 2024, 08:44:41 am »

Quote from: nimish on July 04, 2024, 02:31:31 am

Barrel processors are fine but they are uncommon enough and require a re-think of how computation works, so they are wacky for general purpose use.

Not a great idea to shackle yourself to a single vendor either, if the chip shortages weren't enough convincing.

Agreed and agreed.

Barrel processors cannot overcome "Amdahl's law", and are not relevant to desktop applications. But they are well-suited to "embarassingly parallel" problems. Such systems are commercially important, at a low embedded level (XMOS) and at the server application level (Sun UltraSPARC T series, Niagara etc).

The single source vendor is indeed a significant issue. It occurs at all sorts of levels from resistors and transistors up to complete computers. Examples:

resistors: Caddock resistors in some of HP's scope probes, resistors in 8 digit DVMs
transistors: some high frequency PNP BJTs, jFets, terahertz components
systems: just look for products from any small company with highly competent employees (e.g. Phil Hobbs, John Larkin)
computers: IBM360 - 'nuff said!

So the single-source supply issue has to be taken into account, but is a standard problem.

For an individual engineer's career, it is also an issue. You simply cannot and should not jump on the next shiny geegaw that is similar to existing geegaws, since such experience has a half-life of 3-5 years. OTOH, it is very beneficial to spot the general trends that will become important in the future, and try to anticipate them; that makes you very employable.

tggzzz · « **Reply #60 on:** July 04, 2024, 08:52:19 am »

Quote from: brucehoult on July 04, 2024, 02:48:59 am

Quote from: nimish on July 04, 2024, 02:31:31 am
Barrel processors are fine but they are uncommon enough and require a re-think of how computation works, so they are wacky for general purpose use.

Exactly the same re-think as any other computer with multiple cores (or anyway more than 2 or maybe 4), which has been common for quite some time.

The Mac Pro my employer provided me in 2009 had 8 cores. Another employer got me an i9 with 18 cores in late 2017. The Linux PC I built in early 2019 has a 32 core (64 thread) ThreadRipper. Macbook Pro laptops now come with up to 16 cores. My Lenovo laptop has an i9 with 24 cores (32 threads).

Pretty much everyone who wants to take full advantage of the computer they have now has to ask themselves "How can I divide work up into lots of threads?"

Interestingly, Apple was pushing their answer to using lots of cores in a single application -- Grand Central Dispatch -- a long time ago and it was built in to OS X Snow Leopard (2009) and iOS 4 (2010).

The alternative of trying to make a faster single-core processor has run out of steam, e.g. infamously with HP/Intel's ~~Itanic~~ Itanium processors, and look at all the times x86 processors (from Intel and AMD) have to be neutered because someone else has found a way to abuse the hardware performance enhancements

Pushing barrel processors as being useful for general purpose desktop use would indeed be ridiculous.

There are many commercially important applications that are highly multi-threaded. Such "embarassingly parallel" applications work very well on barrel processors, e.g. telecoms/web servers running on Sun's T-series. Shame Sun was Borged.

tggzzz · « **Reply #61 on:** July 04, 2024, 09:09:18 am »

Quote from: glenenglish on July 04, 2024, 05:51:01 am

Well, I decided to take a look at this famed XMOS product

I have an estemed colleague, who about 5 years ago, a hard SHARC DSP guy, experimented with XMOS and gave it away, went back to SHARC.

There are many reasons why that might be the appropriate decision. Technical prowess is often subordinate to commercial and training requirements.

Quote

https://www.xmos.com/download/xCORE-Architecture(1.2).pdf

I think that that flyer is a whole load of snake oil marketing crap. and quite misleading the way its competition are portayed, and they're like comparing it against a snail.

What do you expect from a flyer? If you expect to grok everything from an advertising flyer, then you will always be disappointed.

Quote

and its written inferring you dont have to synchronize anything, just write code in lots of units and it will all just work.

Strange statement.

That's much like saying "just write code in lots of functions/procedures and it will all just work". Such misstatements could be used to incorrectly imply procedural languages and von-Neumann architectures are a waste of time!

However, the synchronisation primitives and concepts do "just work".

In addition, the hardware is very simple to explain and use, the documentation is blessedly terse (no 1000 page datasheets!), and I haven't run into any errata.

Quote

L ooking,. naa it seems xmos is alive and well, and quite capable
https://www.xmos.com/xcore-ai
It reads well. But I am sure the devil is in the detail. Or is it another modern day take on a GPU architecture ?
Seems the multicore - is not really multicore- more like cores with hyperthreading x 8 per core.
Someone must have bought their chips ?!

Yes, barrel processors and hyperthreading are intimately related. Intel invented the term "hyperthreading" for their two-way x86 barrel processor.

I've no interest in ML/LLM applications and architectures, since their response to stimuli cannot be predicted from the design. Hence I cannot comment on xcore-ai.

I believe there are many chips in use, e.g. in high-end audio and "Alexa loudspeakers/microphones" type things. I have no interest in either.

XMOS has been delivering chips from (IIRC) ~2007, so they must have been around for almost 20 years. They must be doing something right!

dietert1 · « **Reply #62 on:** July 04, 2024, 10:45:02 am »

In our lab there is a XMOS evaluation kit plus some prototype boards made here.
When i looked into it, the training examples proposed using one thread as CPU and then one thread for each peripheral. So peripherals were implemented in firmware. They provided libraries for common peripherals including source (open for improvement).
Forcing one of the threads into precise timing is less damaging than doing it to the Cortex M7 CPU, but things work similar as soon as the threads need to interact.
A design method somehow intermediate between MCU and FPGA. The project would be a little more complex than a MCU project, but without the need to implement a CPU on a FPGA. Of course, nowadays there are MCUs that include FPGA fabric and FPGAs that include hardware CPUs.

Regards, Dieter

tggzzz · « **Reply #63 on:** July 04, 2024, 11:08:36 am »

Quote from: dietert1 on July 04, 2024, 10:45:02 am

In our lab there is a XMOS evaluation kit plus some prototype boards made here.
When i looked into it, the training examples proposed using one thread as CPU and then one thread for each peripheral. So peripherals were implemented in firmware. They provided libraries for common peripherals including source (open for improvement).

Agreed, but you can include the ISR plus some heavyweight processing in such a "low-level front-end" thread/core. Such processing might include "DSP-like" operations.

The details depend on the application, of course.

Quote

Forcing one of the threads into precise timing is less damaging than doing it to the Cortex M7 CPU, but things work similar as soon as the threads need to interact.

Agreed, however...

Where threads need to interact is ideally an consequence of the application and/or the solution's architecture. That is inherent to the problem being solved, and hence is inescapable. We have to live with that

The "damage" caused by forcing everything onto a single core is an consequence of the implementation and toolset.

I dislike problems caused by the tools, preferring to concentrate on the problems inherent in the application. We can - and as engineers we should - strive to remove "problematic tools"

Quote

A design method somehow intermediate between MCU and FPGA. The project would be a little more complex than a MCU project, but without the need to implement a CPU on a FPGA. Of course, nowadays there are MCUs that include FPGA fabric and FPGAs that include hardware CPUs.

Agreed. That is a good summary. The boundary has been shifted.

The toolsets for FPGAs are much more heavyweight and ponderous than those for software development. The software is usually implementing concepts closer to the application being solved. Hence shifting the boundary is beneficial.

We seem to be in violent agreement


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Cortex-M7 pipeline idiosyncrasies… (Read 8186 times)

tggzzz

Re: Cortex-M7 pipeline idiosyncrasies…

glenenglish

Re: Cortex-M7 pipeline idiosyncrasies…

tggzzz

Re: Cortex-M7 pipeline idiosyncrasies…

SiliconWizard

Re: Cortex-M7 pipeline idiosyncrasies…

brucehoult

Re: Cortex-M7 pipeline idiosyncrasies…

nimish

Re: Cortex-M7 pipeline idiosyncrasies…

brucehoult

Re: Cortex-M7 pipeline idiosyncrasies…

glenenglish

Re: Cortex-M7 pipeline idiosyncrasies…

tggzzz

Re: Cortex-M7 pipeline idiosyncrasies…

tggzzz

Re: Cortex-M7 pipeline idiosyncrasies…

tggzzz

Re: Cortex-M7 pipeline idiosyncrasies…

tggzzz

Re: Cortex-M7 pipeline idiosyncrasies…

dietert1

Re: Cortex-M7 pipeline idiosyncrasies…

tggzzz

Re: Cortex-M7 pipeline idiosyncrasies…

Share me