Their current FPPA paradigm is not very compatible with C. It looks like they designed a non-parallel instruction set, and then just made some parallel devices.
SDCC currently does not support multiple FPPA.
For efficient parallelism for the Padauk µC, we'd need:
* Stack-pointer-relative addressing and better stack handling (I'm currently running some experiments - when you compile code with --stack-auto for the current architecure, code size triples)
* Synchronization. We would need at least a compare-and-swap instruction, and an xch withindirect addressing mode. Preferably also an idxadd variant of add M, a.
I believe I can add support for multiple FPPA to SDCC. But the generated code will be big and slow. For practical purposes, running a multitasking OS on a single FPPA would be more efficient.
I don't think true c-compatible paralellism was their ultimate goal. (They don't even provide a C-compiler...). Obviously that would have added a lot of additional complexity, as you point out.
The idea is probably to have specific tasks assigned to specific threads. For example one thread runs the SPI peripheral, a second one runs a control loop, while the main thread mostly sleeps and does housekeeping/reconfiguration. Each of these would reside in their own memory space with dedicated ressources and would use minimal inter-thread communication via pipes. This would keep synchronization overhead down a lot.
For SDCC, the main challenge would be to allow multiple main function to initialize each FPPA. Maybe that could be dont similar to interrupts? It would be up to the user to ensure that variables are not reused between different threads.
Of course, something needs to be done with the p-register. A very rigid approach would be to clearly assign each function to only one FPPA.
A more pragmatic approach, for now, would be to have C-code always stay on FPPA0 and use assembler for the others...
Regarding preemtive multitasking: Of course this would free up some of the synchronization headache, but then the usefulness would be quite limited compared to using multiple hardware threads.