No specifics on the original post in terms of the project? I would interested to hear about it.
The irony it's rather "simple". It's just a "layer down" from HA et. al. It's all code based.
To be rude to it, it's a bunch of python scripts hanging off an MQTT bus. More of the work behind it has been in avoiding code by using clearly defined processes, message structure, topic naming. 99% of boiler code has been rolled up in to the library and a "stub" include which even gives you your global access stuff like "mqtt client" and "clock". A minimum deployable service has more lines in it's deployment descriptor (5) than it does code, 2.
from home_auto_lib.mqtt import client
client.blocking_run()
The overall architecture is two MQTT buses. I call them "Prod" and "Internal". "Prod" is where any random device connects to the network. "Internal" is the service bus. On this bus all message formats and topics are normalised into a "data fabric". The concept of a "proxy" servers as integration between these two buses. To be accepted onto the internal service bus the translations usually happen in the proxy components. This is probably the hottest area for work when new devices arrive in the flock. It needs a good dose of engineering, it's a mess to be honest.
The first thing that jumps out at me when I go to HA is that it's a single process, monolithic python application. It is also a state machine, in memory store. Any and all "in flight" transactions are in memory. If you switch it off and on again it drops everything in flight. It attempts to handle "everyone's" problems. It manages to solve 90% of 90% of peoples problems. If you buy into the eco-system and want something in those other 10%s... Good luck. My own "feeling" is that the added complexities introduced with the intent of solving someone else's problem actually hinder development on the more bespoke and lower levels.
It's not "wrong", it's just that by the time HA and others came along I had already seen the problems I will face in the future and decided to side step them before I wrote any much more code. Instead of managing the growing complexity in a big monolithic "state machine", I split everything up across the network such that no code "could" "borrow" or modify state from another. Making all processes expose their guts onto the message bus and making the message bu the source of truth, REALLY simplifies the half of it.
It is entirely routine for me to restart services rapidly while making dev changes. The system doesn't care. Very often if a bug does persist and a service fails, all the others just carry on. The specific functionality provided by that service stops though. Usually I notice as a gap on a graph in Grafana. If it's late, I just "restart" the whole stack. Figure it out later. This relies on the project rules of engagement that all services should be stateless and emphemeral. Shooting them in the head and a new one will spawn in it's place and take over unbothered.... unless it continually crashes and restarts.
I had a bug with a heating service where it took it's inputs, make it's choice and published it's requests. It then hit a typo in a logging statement and crashed the service completely to console. It got restarted. The heating continued to function absolutely 100% without lose of functionality for 2 months last winter until I fixed it. The poor services was crashing about 2 dozen times a day.
It's not perfect, but is it is full control. The libraries were not developed as an entity in its own right. It is code that was already peppering services, so they migrated into a "common" and the services could be split up into several packages, there are a few dozen scripts and another dozen or so support files. This is my complete bespoke setup. The library and core components are about 6 files.
There is no persistence internal to the service bus. The "expectation of volatility" rules of engagement just ... side step this entirely. Don't persist. Rebuild. There are caveats of course. For persistence, you probably guessed it. I use services. The only service that exists that qualifies as a persistance "write" service is the "Influx Mqtt Bridge" which takes bus data and publishes it to InfluxDB. The only persistence "read" service is the "Manager" component which is responsible for generating and publishing the master clock pulse and the static config data.
In fairness, if you want most of the system to simply "stop". Stopping that clock service will suspend most of it.
EDIT:
A funny thing we do in software design, sometimes, is to make components of the system "actors" and describe their role as if it was a person describing their role.
I as a "Paul's Home Auto" service will subscribe to data changes on the following topics of interest.....
If I need config I shall subscribe to the topics that contain it and response to updates.
I will patiently wait for that data and not act without it.
If I have incomplete or invalid data I should always default to "safe".
If I cannot do the "right" thing, I shall do the "safe" thing.
If I wish to know what time it is, I shall ask the master clock, I shall NEVER ask the OS.
When I have my data I shall make my determination and notify other services by publishing my desires on the bus.
If I fail... I shall die to be respawned a new.
Where possible, I should "Keep calm, log it, and carry on."