Have a look at the XMOS processors. They have hard realtime scalable multicore hardware and software, ...
VERY INTERESTING ! The design is such that each task has it own core yet can communicate to other cores.
They are, aren't they! Most MCUs are very similar to each other, and most languages are inherently serial, not parallel. It is rare to find some that are significantly different and, most importantly, unify the hardware and the software.
I have only "kicked the tyres" with a simple design, but I found their documentation remarkably simple, clear, and without strange "gotchas". I haven't found any bugs either
It is worth realising that the concepts are old and have stood the test of time: CSP is from the 70s, hardware for CSP is from the 80s (Transputer), software for CSP from the 80s (Occam), and XMOS xCore is a decade or so old. Many CSP/Occam concepts are re-materialising in modern languages such as Go and Rust (but I've used neither).
Prof. David May was has been involved in all of that, and has avoided past problems.
I would like to know more about how their XCONNECT switch handles off-chip messaging.
I am not an expert and have not investigated this, however I suspect I can offer a few pointers:
- within a tile all cores/tasks share the same memory on a timesliced basis - think of SMT
- within a tile, inter-task comms can be implemented either via a comms channel or via shared memory
- across tiles, comms have to be implemented by copying memory using a comms channel; clearly this adds latency
- xC ensures all that is transparent. There are restrictions to ensure correctness, e.g. no cross-task aliasing of memory
- if comms can occur between tiles on the same chip, extension to comms between tiles on different chips is trivial. ISTR it requires 5 wires and a serialisation protocol, but you would be wise to verify that
- both i/o and inter-task comms uses the same language primitives and xCONNECT. That works very pleasantly - just "think of" the i/o port as a different task.
FFI, dig around on the XMOS website and forum to find more information. I've found these documents particularly useful, but others are more directed at your questions.
XMOS-DIY-USB.pdf
XMOS-Introduction-to-XS1-ports_3.pdf
XMOS-Programming-Guide-_documentation_F-2.pdf
XMOS-XCC-Command-Line-Manual_X6904A.pdf
XMOS-XC-Reference-Manual_8.7-[Y-M].pdf
XMOS-XS1-Architecture_1.0.pdf
XMOS-XS1-Assembly-Language-Manual_8.7-[Y-M].pdf
XMOS-xTIMEcomposer-User-Guide-14_14.x.pdf
Also I find the following statements hard to believe based on my knowledge of chip design
- Each tile contains local SRAM memory, which is shared between all cores on that tile for code and data
- Each scheduled core has an allocated slot to access the memory in a single cycle
- The xCORE memory will always respond within the allocated cycle
Execution out of RAM is $$$ for large applications.
This is aimed at hard realtime embedded programming, not general purpose systems. See digikey for available processors.
I would question the necessity of having very large memory shared between many cores/tasks. Everything I've heard leads me to believe that "high performance computing" is heavily based on message-passing between separate non-shared memory computers. I believe that general purpose systems will have to go that route, but it will take a generation of kicking and screaming by people wedded to existing languages and implementations.
Although the HPC-vs-xCORE/xC details are very different, many of the high-level paradigms are similar: if you can "think" in one, you can "think" in the other. (The same can be said of Java-vs-C#, for example).
One BIG benefit for "reflective memory" is that you double the amount of RAM and Flash available to the total application. Yes, it will require some forethought on how to divide the tasks to best utilize these resources.
The number of cores/tasks is a hard limit - with exceptions! Given some reasonable restrictions on how a task is coded, the compiler can silently combine several tasks onto the same core. Essentially this comes down to sequentially merging all the "setup()" parts of a task, and having the "while (1) {select...}" parts into a single select statement.
Clearly a "reflective memory" serial communication channel would be setup for full duplex. Differential drivers may not be required if the link is going to another device on the same board. Yes, this is starting to look like a SPI network on steroids, so clearly it could also be used for "intelligent" I/O chips (just map their registers into the controller chips memory space) and multiple channels could be used for devices with different latencies.