personal experience (sorry for the long post ) :
bug 1:
3'rd party embedded os that used some values stored in memory to decide if it's a cold or warm boot:
if ( 0xDEAD == ram[0] && 0xBEEF == ram[1] ) => warm boot. ( they wanted to be "safe" )
in their case, for some reason " && " was actually " || " -> so the code would think that it was a warm boot when only half of the conditions where met so it skipped over some of it's initialization steps and fail some time after when it tried to run code from peripheral space
on some mcu's the RAM, after a cold boot, would tend to have 0xBEEF in it ... one in several thousands ...
a nice and frustrating bug ... i think i simulated 1000000 hours of run-time on all the dev boards we had until we got a production unit that failed in the factory - that board failed 5/10 cold boots and could finally find the problem.
in this case no tool would have been able to catch it ...
bug 2:
CAN frame that contained a size byte and some payload.
some application code got the can frame, extracted the size, done some math on it then memcpy from the frame buffer to an array.
well ... sometimes the CAN frame would have a size of 0 ( not documented behavior ) ... -> memcpy with size -1
the kernel killed the application, but only after if corrupted some shared memory that was mapped in the application space and used as a very fast ipc.
=> the code that handled the ipc(mine) was the one that started complaining that someone corrupted it's data => "it's your bug, not mine"
guess who had to spend a "lovely" evening replaying hour-long CAN traces to reproduce / figure out what CAN frame was generating the issue and then finding out where it's failing ...
in this case :
- basic defensive programming would have fixed it from the start, instead of "i'm a high level C++ linux application programmer, input data is always right so the bug is not in my team's code" ( you know the speech )
- Ada might have helped ... if the client actually gave a damn about tools and dev time lost to digging out stupid errors
- review - only if the reviewers have the proper mind set
bug 3:
a simple buffer - a byte array that stored data until all of the chips on the board where up and running.
in some very rare occasions - like 1 in 10000 cycles one of the chips took way longer than expected and the data rate on the input bus spiked -> the buffer overflowed by 1 byte.
just enough to overwrite some stuff from an 3'rd party eeprom emulation library -> the library happily trashed half of the attached flash memory.
on the next boot you had no more system settings or identification data ... no complete failure but you lost some important functionality
again, caught on field and reproduced after days of repetitive testing - it's a joy to work on systems that are not predictable
- review ... maybe, the buffer was already 2x the worst we could estimate
- testing ... you end up having to simulate the entire solar system if you need to be 100% sure
- defensive programming : yes, if your constantly enforce this.
- run-time range checks - certainly yes - the cost of lost time + updating all devices that are already on the field might be greater than a full Ada license + training + round the world 5 stars company payed cruise for the team that found the bug
=>
there are no magical solutions.
you can make horrible code with the best tools if you rush it and don't give a damn about the result.
or you can make an almost perfect system with assembler and a piece of string as a debugger, if you work on it your entire life.
also, there's the question of cost : how expensive is this failure ?
if your toyota decides to go full throttle for no apparent reason you get an ugly accident with several victims
if your rush-hour filled to the brim subway decides to go full throttle and "extend the tunnel" ... you'll need more than a mop and a bucket to clear up the mess.
i've been in a subway train that had a software crash - the doors where no longer responding, the mechanic manually opened some of them via the emergency release.
he had to power cycle the train to get it going again.
regarding rail-road relays:
- the are "gravitational relays" - there is no spring to pull the armature in the off position - the weight of the armature does this, so they work only in a specific orientation
- the contacts are made from materials that do not weld - graphite and silver for example, so they'll disconnect even if the contact points are white-hot