I don't think true c-compatible paralellism was their ultimate goal. (They don't even provide a C-compiler...). Obviously that would have added a lot of additional complexity, as you point out.
Still, on such a "multicore" µC, you want efficient communication between multiple cores and interrupt handlers, which means efficient lockless atomics, which means a compare-and-swap instruction, preferably with indirect addressing mode.
The idea is probably to have specific tasks assigned to specific threads. For example one thread runs the SPI peripheral, a second one runs a control loop, while the main thread mostly sleeps and does housekeeping/reconfiguration. Each of these would reside in their own memory space with dedicated ressources and would use minimal inter-thread communication via pipes. This would keep synchronization overhead down a lot.
For SDCC, the main challenge would be to allow multiple main function to initialize each FPPA. Maybe that could be dont similar to interrupts? It would be up to the user to ensure that variables are not reused between different threads.
Of course, something needs to be done with the p-register. A very rigid approach would be to clearly assign each function to only one FPPA.
A more pragmatic approach, for now, would be to have C-code always stay on FPPA0 and use assembler for the others...
Regarding preemtive multitasking: Of course this would free up some of the synchronization headache, but then the usefulness would be quite limited compared to using multiple hardware threads.
SDCC is a C compiler, and will aim to comply with the C standard. Where that can't be done efficiently, we still need to be able to comply, but offers alternatives. E.g. when compiling for Padauk, functions are currently not reentrant by default (but reentrancy can be enabled for individual functions using __reentrant, or for the whole program using --stack-auto; it about triples code size though).
For multiple FPPA, my idea would be:
By default make the code work on any FPPA. That means a lock around any use of p. And no longer using p to pass return values (thus 16-bit return values would have to be passed on the stack, like we currently do for return values > 16 bit). Inefficient, but avoids nasty surprises.
Then provide a way for the user to specify that a function is to be used on a specific FPPA only, e.g. something like
[[sdcc::fppa(3)]] void f(int)
{
…
}
An FPPA3-specific p3 would be used instead of p when generating code for f.