Author Topic: CAN bus and others, how are really garbled transmissions protected against  (Read 1442 times)

0 Members and 1 Guest are viewing this topic.

Offline InfravioletTopic starter

  • Super Contributor
  • ***
  • Posts: 1066
  • Country: gb
So, lets consider a CAN bus or another bus protocol running some big moving piece of equipment which can damage itself if it gets given the wrong command at the wrong time.

What is typically done in this sort of scenario to protect against garbled transmissions?

Putting checksum bytes of parity or error correction bits in to the transferred data can obviously eliminate most cases of slight corruption. BUT what if you get a situation where a message gets so corrupted that the received message ends up with the perfect storm of corruption in the message and the checksum such that it looks like a valid message again, except now the valid message is telling the receiving device to do something it really shouldn't do at this time?

Say a message on some arbtrary bus is sent like:
1,255,check_byte
meaning move yourself in direction 1 (fwds) at speed 255/255
but noise on the line(s) means what arrives is
0(corrupted),255,check_byte(also_corrupted)
and the corruption is the "perfect storm" that the corrupted checksum is valid for the corrupted message, now the thing is moving in direction 0 and about to tear itself apart because is should only move in direction 0 at half speed and at a different time when something else has gotten out of the way...

Is there a standard method used to protect against these perfect storm corruptions? I ask about CAN bus as being automotive youd expect any errors to have pretty damaging consequences, so those who make regular use of it would have a good motiviation to work out a way to handle those perfectly terrible errors. The same goes for communication methods used inside avionics systems, factory automation, controlling robotic arms...

Or for other buses in smaller systems such as I2C, you could also get situations where the corruption affects the addressing, so a valid message intended for one device, ends up being read by some other device on the bus, which interprets the same message in a wholely different way...

Do people developing those sort of systems have some way to actually eliminate the possibility of a corrupted transmissions being corrupted into looking like a valid message with a different meaning than that intended? Do they just work by procedures of sending the message several times before an action is taken? Do they just tolerate this possibility but drive the chance of it down to some once-in-a-million-years type threshold by using sufficiently many bits or bytes of anti-error coding per bit or byte of actual message data?

Thanks
 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 11420
  • Country: us
    • Personal site
CAN also checks signal levels outside of the sampling points. Plus both receive and transmit side monitor the bus, and if TX side detects that the line level does not represent what it wants to transmit, it stops the transmission and intentionally generates an error signal.

And additional protection is provided by the fact that nobody relies on reliable delivery, so the messages are sent periodically. And the period is much smaller than a typical reaction time. So, even if by a miracle some sensor value gets damaged, it will be updated in the next few ms to the correct value. There are not a lot of components that will be able to react in that time.

I2C does something similar, although much less robust. When a master detects that the line level does not match what it wants to drive, it assumes the conflict and stops the transfer.
« Last Edit: October 27, 2023, 05:01:35 pm by ataradov »
Alex
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8271
  • Country: fi
In the end, it is managing possibilities. It is possible that infinite number of monkeys write perfect copies of all work by Shakespeare, and so it is possible that given infinite time, at some point random noise has generated a CAN packet which is valid (data, CRC, all), which instructs the lathe to kill its operator. It is just unlikely enough, and this is something which is not too hard to ballpark even on a napkin. If it requires billion of machines running for billion years, then you can maybe accept the risk.

CAN is resilient against this because it uses a pretty long CRC code compared to the packet payload length. It is practically impossible to get a garbled CAN message through the CRC check. A dangerous bug in CAN code (including the HW peripheral) is far more likely, like dozens of orders of magnitude, or say memory corruption; even with ECC.

It is important to understand absolutely nothing is zero-risk, and concentrate on highest-risk items, which would be, usually, in this order: human errors in software code, then human errors in hardware design, maybe then bit errors in non-checksummed RAM and bit errors in non-checksummed simple communication interfaces (think I2C), then bit errors in parity-checked RAM or interfaces. Strongly checksummed (even just 8-bit CRC) is much further.
« Last Edit: October 27, 2023, 06:07:27 pm by Siwastaja »
 

Offline pdenisowski

  • Frequent Contributor
  • **
  • Posts: 752
  • Country: us
  • Product Management Engineer, Rohde & Schwarz
    • Test and Measurement Fundamentals Playlist on the R&S YouTube channel

In a former life I spent a lot of time working with various communications protocols.

  • If you want to avoid errored messages that appear to be valid, you implement a robust (long) checksum.
  • If you want to avoid out of sequence or missing messages, you implement a sequence number (also protected by a checksum)

But you still can't - theoretically - drive down error probability to zero. 

I had a case once where a data link appeared to have almost entirely failed:  only about one in several million transmitted messages was being "properly" received.

When I put an analyzer on the link, I noticed that messages were being corrupted because the 3rd bit in every byte (including the checksum bytes) was always being set to zero during transmission due to a hardware fault. 

But every now and then either a packet would be normally have a zero as the 3rd bit of every byte (and survive transmission) OR -- much more disturbingly -- a packet would have that bit corrupted in every message byte and the corrupted checksum would be valid for the corrupted packet (!!!)

This was a very high speed data link (I think it was an OC-192 connection) and it took hours for it to occur, but anecdotal "proof" that things like that do happen in the real world.
Test and Measurement Fundamentals video series on the Rohde & Schwarz YouTube channel:  https://www.youtube.com/playlist?list=PLKxVoO5jUTlvsVtDcqrVn0ybqBVlLj2z8
 

Offline zilp

  • Regular Contributor
  • *
  • Posts: 216
  • Country: de
The fundamental problem that you can't reduce the risk to zero has already been pointed out.

However, you *can* still calculate probabilities based on a noise model and properties of the redundancy mechanisms that you employ to detect (and potentially correct) transmission errors, and you can select those mechanisms to achieve a target failure rate.

As for CRCs in particular, this page can be helpful:

https://users.ece.cmu.edu/~koopman/crc/

If it really matters and you have the computational power, you can also use cryptographic hashes, that drives down the probability of accidental corruption to a level that all other sources of failures will dwarf the remaining risk. If you add a secret to use a MAC, you can even protect against intentional interference, which gets important when you have an untrusted communication medium (radio, shared networks, internet, ...).

Also mind you that you have to pay attention to protect everything that is semantically relevant, so stuff like message sequence completeness might be just as important as message integrity.
 

Offline pickle9000

  • Super Contributor
  • ***
  • Posts: 2439
  • Country: ca
There is also a life critical situation. Flood gates on a dam, Flight control systems. In those systems they often use redundant communication lines computers and even multiple programming teams to reduce errors.
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 3784
  • Country: us
A long checksum can make the probability of a random error being accepted arbitrarily low, for instance a 32 bit checksum or random data will only be accidentally correct one in 4 billion times.

Another part is chosing the right checksum for your error model.  Some checksums are weak against common errors such as transpositions, repetition, extra zero words at the end, or .  Some codes have error correcting capability: that's important is losing a message can also be dangerous. 

You also have to consider what happens if your link goes down completely.   Your checksums will hopefully ensure that line noise isn't misinterpreted as data, but losing messages can also be dangerous if the message is "emergency stop".  That's another reason CAN messages are often repeated constantly.  If a safety critical receiver doesn't receive a message in a certain amount of time it can go into fail safe mode.

Also pick you command set carefully.  Don't have a command to toggle some state (like a power button on a TV) because if you lose track of the current state you will do the wrong thing.  Try to have messages declare the desired state as completely as practical.  The goal is for the message to not depend on the previous message having been executed properly.
 

Offline MarginallyStable

  • Regular Contributor
  • *
  • Posts: 68
  • Country: us
Another way that is used in safety critical systems is to model the system in question, and compare input values with what is expected. examples: I command a heater on, and from the system model we expect it to raise 1deg per minute. If we get a single bad message that all of a sudden jumps by 10 degs, we can throw that data out... One plus to this is you can also use the model value as your input for this single bad message. Now if we get several bad messages we can decide that the source is unreliable and flag a fault. I have worked with systems where this model is implemented in analog hardware (think analog computer), or software.
« Last Edit: June 24, 2024, 09:13:05 pm by MarginallyStable »
 

Offline SteveThackery

  • Contributor
  • Posts: 20
  • Country: gb
As you've already heard, you cannot guarantee zero risk of bad instructions being received, but there's a lot you can do to drive the risk towards zero.

There is another layer you need. You need to make the receiving device smart - smart enough to perform sanity checks on the incoming instructions. This is essential - not only might a corrupted transmission cause a harmful instruction, so might a programming mistake - a bug - in the controller software.  No amount of error checking at the transport layer can protect against that.

It is the responsibility of the receiving device to be robust against instructions or data that could be harmful, not the controller. The receiving device is responsible for its own safety and the safety of any people or things in its operating radius.  This is a fundamental principle of a fail-safe design.  This responsibility cannot be placed in the control unit because a fault scenario might arise where the controller has crashed, or cannot establish a data link, or might even have been hacked.

Also, the controller might not be aware of all the possible moves and other functions the receiving device can perform, so there may be some "protective" instructions it doesn't know about.  In effect, the machine that receives the instructions presents an API to the controller, with all the valid control commands it can use.  But the receiving device may well have all sorts of machine-specific functionality that is not in the API, so the controller knows nothing about it.  It might be housekeeping stuff to do with tool loading, overload protection, etc.  That is why the sanity checking needs to be done by the receiving device: only that device has a complete knowledge of itself, so to speak.

As far as possible you should ensure the machine receiving the commands cannot do anything harmful to itself or it's surroundings, regardless of what comes down the data link (including when nothing comes down the link).

This concept is the mechanical equivalent of object orientation, and specifically encapsulation, in software. It is a well established engineering principle.
« Last Edit: June 25, 2024, 07:16:22 pm by SteveThackery »
 

Online David Hess

  • Super Contributor
  • ***
  • Posts: 16857
  • Country: us
  • DavidH
What I have done in the past besides some sort of hash or checksum or CRC, is to send the message twice, with the second time being inverted, and maybe have the receiver repeat it back.  If I was using asynchronous serial, then the natural thing is to repeat each character back as it is sent.
 

Offline SteveThackery

  • Contributor
  • Posts: 20
  • Country: gb
What I have done in the past besides some sort of hash or checksum or CRC, is to send the message twice, with the second time being inverted, and maybe have the receiver repeat it back.  If I was using asynchronous serial, then the natural thing is to repeat each character back as it is sent.

There is a simpler algorithm that is even more bulletproof.

1/ The sender sends the message repeatedly until it receives the same message back.

2/ Every time it receives a message, the receiver echoes it back to the sender.

As you can see, it is very simple. It is called "compelled" signalling, and it is compelled in both directions. When the sender receives an echo of its outgoing message, it knows for certain that the receiver has received it. 

There is a disadvantage to this system: it potentially uses a lot of bandwidth on the data link. I say "potentially" because a clean transmission from sender to receiver, followed by a prompt echo by the receiver, followed by a clean transmission back to the sender, doesn't use much bandwidth. Only when something gets corrupted, or the receiver is slow to respond, does the data usage increase because of the repeats.  This doesn't matter if the data link is point-to-point and not shared. If the data link is actually two points on a TCP/IP network (for instance), there is a risk of creating too much network traffic. The solution is to slow down the repeat rate from the sender, giving the receiver more time to respond.
 

Offline max_torque

  • Super Contributor
  • ***
  • Posts: 1307
  • Country: gb
    • bitdynamics
If it's really safety critical, you don't use CAN you use FLEXRAY!  (which is ISO26262 capable)
 

Online Doctorandus_P

  • Super Contributor
  • ***
  • Posts: 3529
  • Country: nl
Even with a 16-bit CRC you already have a quite high reliability. Bus communication tends to either work nearly perfect, or get lots of errors. So a rise in faulty packets (which require resends which can be counted) can be used to detect problems and sound some alarm.

Also, do not underestimate the mechanical part. With "moving equipment" you need flexible cables with thin copper strands to reduce metal fatigue. Measuring wire impedance (termination resistors) can be a part of verifying signal integity. For a CAN bus for example a few resistors of 100k or thereabouts could pull the wires into a fault condition if a wire breaks, and this can give an alarm even for intermittent wire faults.

Fault detection can also be combined with redundant buses.

 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf