I really miss working on low-budget space electronics. Talk about writing software for an all-too-real world that's out of this world!
Using distributed microcontrollers really is the way to go for far more than just power management reasons.
First, it is important to know that "rad-hard" does NOT mean "radiation proof"! Only shielding can make anything rad-proof, the more the merrier, except when you start needing a larger booster to lift the extra mass. So most space electronics uses only the absolute minimum shielding in the fewest possible places to get the "weaker" components to have lifetimes that are long enough. And that shielding is optimized to use the lowest mass of material that provides the minimum needed radiation stopping power: Oddly enough, lead is rarely used, if ever.
Most space electronics (all of it on nano-budget systems) is pretty much "hanging in the radiation breeze".
All electronics in space will exhibit significant rates of "hits" from radiation (there's no easy way to shield against cosmic rays). There are several flavors of hits that exhibit different physics within the chip (if a PN junction is hit, or an insulation layer, or the metalization, or any combination) and different effects from those hits. The effects range from flipping a bit for an instant, to getting a bit "stuck", all the way to creating a dead short between VCC and GND. There are two basic mechanisms available to detect hits, and just one primary way to recover from them.
The simplest detection method is a hardware watchdog timer: If a hit in any way affects how the main software runs, eventually it will either "go insane" or be detected by the BIST (Built-In Self-Test) code. Either will cause the watchdog timer to trip, thus resetting the processor. However, in space we always cycle power, and never rely on the reset logic (which can itself get hits).
The other method (the most important one) is to monitor the current used. It is easy to see when a short is occurring, and to temporarily cut power long before the chip sees any significant thermal rise or damage. Radiation deposits energy into places in the silicon where it isn't wanted, and turning off power for a short while is the best way to let that energy dissipate and the damage done to "heal".
So, what makes a circuit "rad-hard"? Basically, it is the total dose below which the chip will continue to recover from hits. All chips in a radiation field will eventually fail: Rad-hard chips simply take longer to reach the point where it doesn't recover after a power cycle. Both COTS and rad-hard chips will see similar radiation effects while in use: You pay extra to get chips that can endure it longer.
What you have in space are chips that will randomly freak out, especially the processors and memory. They do this as a normal part of operating in a radiation field. This is where the software design gets fun. How do you cope with a processor that may randomly get its power cycled when passing through the Van Allen Belts?
First, you need fast detection, so you don't try to operate with a failing chip. A current detector will typically trip within 1-5ms, and the BIST loop operates on a 10ms-100ms cycle. The power-down time after a hit is typically 10ms-100ms. In one system I worked on, we used a 25ms off time, but with rapid re-detection to go right back into power off up to three more times (100ms total) if the hit hasn't resolved.
What next? Well, since your mission software is not running during a hit, it is vital to get back to full operation as fast as possible. And that means booting must be as close to instantaneous as possible, but it must include checking the critical parts of RAM and FLASH for errors. Of course, you have no idea when the next hit will arrive, so it is also vital that the most important code run first, as soon as the startup code has verified the processor and memory work "well enough" to run the most critical computations.
But, perversely, even though you are running the most important code first and quickly, it must not need to run all that often, say at 1 Hz or so, so that the system will perform correctly even when in the most intense radiation fields.
It should seem clear that there is no time to boot a conventional OS to run critical functions: While many space systems do use POSIX-like RTOSes, they are run on processors that do lots of relatively unimportant work (such as bulk data processing). The most vital tasks are generally farmed out to microcontrollers.
There is another important reason to use microcontrollers, other than for isolating vital software routines: Radiation incidents are proportional to the silicon area. A general-purpose processor (GPP) can easily use a square inch of silicon for the processor, flash and ram. A typical microcontroller uses about 1-5% of that area. So you would expect a GPP to see hit rates that are 20x that seen by a microcontroller, just based on area alone.
So, would you care to take a guess as to how much shielding is needed to improve the hit rate for a GPP to match that of a microcontroller? As much as 30% of the mass in some commercial communication satellites is shielding. But when you add shielding, you need to make the station-keeping motor more powerful and store more fuel, and the satellite frame (called the "bus") also has to get larger and stronger. Which is why some satellites weigh as much as a school bus, instead of a Fiat.
Now let's step back and look at the big picture. Let's say you can't afford ANY specialty rad-hard components, yet you have to pass through the Van Allen belts to get where you are going (such as to the moon or GEO). What do you do?
First, you keep as much as possible TURNED OFF when passing through high radiation fields. Especially the GPPs.
Second, you go through the high radiation areas as fast as possible (minimize time): Pair the smallest possible satellite with the biggest booster you can afford and the straightest course you can handle.
Third, you add redundancy: Have multiple redundant processors for the most important tasks. If you do it right, you use voting circuits that are implemented using discrete or "jelly-bean" logic. Some of these simple devices are inherently rad-hard, so careful circuit design (and shopping) pays off!
Fourth, make the software in each microcontroller "run light without overbyte". In one system I built, the system was able to reliably perform its critical functions in a radiation environment that caused 5 power cycles per second! And that system had microcontrollers clocked at only 4 MHz. I would love to do that project again using 20 MHz AVRs!
-BobC