Author Topic: Faster alternatives to CRC? (Read 30651 times)

Infraviolet · « **on:** January 05, 2024, 09:37:44 pm »

I'm trying to implement a CRC checksumming check on an AVR microcontroller.

I'm making a 16 bit checksum from up to 35 bytes of data.

I use a 16 item table of uint16_t values to a table driven CRC algorithm.

The algorithm calls:

uint16_t crc_of_byte(uint16_t crc, uint8_t data_byte){
uint8_t i;
i = (crc >> 12) ^ (data_byte >> 4);
crc = pgm_read_dword_near(table + (i & 0x0f)) ^ (crc <<4);
i = (crc >> 12) ^ (data_byte >> 0);
crc = pgm_read_dword_near(table + (i & 0x0f)) ^ (crc <<4);
return crc;
}

for each byte of the up to 35, with the result of the function being fed back in as crc when it run for the next byte of the array it is calculating the CRC of.

But it seems rather slow, quite a lot of clock cycles per byte to the extent that with a 16MHz clock rate my AVR chips are taking about 6.5us per byte to calculate checksums.

I cannot switch to a 256 element table as that would take too much PROGMEM space.

I'm aware of something called a sliced CRC table, which is apparently faster than normal CRC tables, mentioned at https://pycrc.org/pycrc.html . But I tried generating a CRC table by that method using the very useful python script provided, and it wouldn't do it unless I used a 256 element table anyway.

I am not tied to using a particular CRC polynomial, nor even to using a CRC at all, I could try anything which can detect errors and produce a 2 byte "checksum" for an array up to 35 bytes.

What else could I try for error detection? I'd like methods which would do a good job of catching burst errors, and "sliding" errors*, as well as having the largest easily feasible hamming distance (max number of bit flip errors somewhere throughout a message before you can potentially get a bit flip which makes a corrupted message which still passes the check as if it were valid).

Thank You

*By sliding, I mean, imagine the effect on data if a single bit is missed, and everything following shifts earlier by one bit, or if a phantom bit is added and everything afterwards shifts later by one bit.
121,82,96,3
(
01111001,
01010010,
01100000,
00000011
)
could "slide" to, if an error happened that missed the last bit of 121:
120,164,192,6
(
01111000,
10100100,
11000000,
00000110
)

ataradov · « **Reply #1 on:** January 05, 2024, 09:58:51 pm »

https://en.wikipedia.org/wiki/Fletcher%27s_checksum is pretty good for detecting bit errors in short messages. It has a few weaknesses, but those are usually not an issues for real applications.

brucehoult · « **Reply #2 on:** January 05, 2024, 10:15:16 pm »

Quote from: Infraviolet on January 05, 2024, 09:37:44 pm

I cannot switch to a 256 element table as that would take too much PROGMEM space.

That seems rather strange as that's only 512 bytes and even the ATTiny25/45/85 series is available with 8K of program space. It will be much faster!

You could switch to a simple xor checksum (multiple CRC-1 streams in parallel), but that won't meet your burst etc requirements.

I don't think you're going to find any algorithm with as good or better protection properties than CRC with lower memory or clock cycles requirements. There's a reason the CRC algorithm is still used after sixty years, though better polynomials are known today than were initially used (e.g. CRC-32C (Castagnoli)).

And of course a 32 bit CRC will give much better protection than a 16 bit CRC.

Berni · « **Reply #3 on:** January 05, 2024, 10:24:28 pm »

I also like using the fletcher checksum.

It is very fast on MCUs because it is just 2 sum operations. Sure CRC is more efficient in hardware, but that doesn't help unless the MCU has such hardware built in.

brucehoult · « **Reply #4 on:** January 05, 2024, 10:29:24 pm »

Quote from: ataradov on January 05, 2024, 09:58:51 pm

https://en.wikipedia.org/wiki/Fletcher%27s_checksum is pretty good for detecting bit errors in short messages. It has a few weaknesses, but those are usually not an issues for real applications.

Fletcher and Adler are certainly a lot cheaper to compute than CRC, for slightly worse but maybe adequate protection.

You could also look into the much more recent murmurhash, cityhash, farmhash series developed primarily for very fast hashing of relatively short keys for hash tables. There's no reason they can't also be used for data transmission. I guess one thing to watch out for is that they pretty much assume shifts by arbitrary amounts are fast, which is not the case on AVR, unlike on typical ARM/RISC-V implementations.

SiliconWizard · « **Reply #5 on:** January 05, 2024, 11:11:16 pm »

Good alternatives above. But don't forget the ultra-simple approaches if all else is still too expensive: the simple checksum, just one addition. It's not ultra-robust for sure, but has proven adequate in myriads of applications as long as a *very occasional* (in practice) false "ok" is not overly critical.

But I otherwise second the Fletcher checksum as a better alternative while still cheap. And of course as Berni mentioned, that's a lot more efficient in software but for direct hardware implementation, the story is different.

Infraviolet · « **Reply #6 on:** January 06, 2024, 12:59:03 am »

"That seems rather strange as that's only 512 bytes and even the ATTiny25/45/85 series is available with 8K of program space. It will be much faster!"

Thanks, I'll double check how much flash space is in use for everything else the program does, things are tight on an ATTiny85, but there might be 512 bytes offlash to spare. I might have been mentally equating those bytes with the SRAM, which is ofcourse much smaller than the flash in which a PROGMEM array can sit.

brucehoult · « **Reply #7 on:** January 06, 2024, 01:14:11 am »

Quote from: SiliconWizard on January 05, 2024, 11:11:16 pm

Good alternatives above. But don't forget the ultra-simple approaches if all else is still too expensive: the simple checksum, just one addition.

I mentioned XOR, which has essentially the same cost and properties as ADD. I wrote XOR/ADD at first, but simplified before posting.

An N-bit XOR checksum is exactly N 1-bit CRCs interleaved. And ADD checksum is similar but with harder to analyse properties due to the internal carries (it's not better!)

Quote

It's not ultra-robust for sure, but has proven adequate in myriads of applications as long as a *very occasional* (in practice) false "ok" is not overly critical.

Any N-bit checksum, no matter what algorithm (as long as all input bits contribute to the checksum!) will detect all but 1 in 2^N of all possible random transmission errors i.e. any 16 bit checksum will detect 65535/65536 of random errors. ADD/XOR does not differ from CRC in this.

The difference is in matching the kind of checksum to the expected non-random distribution of transmission errors.

In practice, with many transmission technologies, most errors occur in bursts, with most bits transmitted correctly (usually no errors at all) but with at most one contiguous burst of incorrect bits. N-bit CRCs are precisely the mathematical algorithm that guarantees that not only all bursts of N bits or fewer will be detected but also a very high proportion of bursts of a small multiple of N.

All other checksums are inferior in their protection against a single burst of errors. An XOR/ADD of size N, for example, is easily fooled by a burst of length N+1. Fletcher is actually worse, in that an N-bit Fletcher checksum (two N/2 bit checksums) can also be fooled by bursts of N/2+1 bits i.e. a 16 bit Fletcher checksum is only guaranteed to detect bursts up to 8 bits long.

If bursts are the likely problem, then a 16 bit XOR (two 8-bit XORs with interleaved streams) is much better than 16 bit Fletcher.

On the other hand, if data arriving out of order is a likely problem then XOR is completely useless. That's maybe a problem if you're checksumming an entire file, but shouldn't be if you're checksumming each indivisible data packet.

oPossum · « **Reply #8 on:** January 06, 2024, 01:30:35 am »

This is the method I use for small code size with good performance. It is essentially an optimized explicitly unrolled loop.

Code: [Select]

uint16_t crc;

void update_crc(uint8_t x)
{
        x ^= (crc >> 8);
        crc <<= 8;
        if(x & 0x01) crc ^= 0x1021;
        if(x & 0x02) crc ^= 0x2042;
        if(x & 0x04) crc ^= 0x4084;
        if(x & 0x08) crc ^= 0x8108;
        if(x & 0x10) crc ^= 0x1231;
        if(x & 0x20) crc ^= 0x2462;
        if(x & 0x40) crc ^= 0x48C4;
        if(x & 0x80) crc ^= 0x9188;
}

brucehoult · « **Reply #9 on:** January 06, 2024, 05:55:44 am »

Come to think of it, an N+1 bit error burst is 100% GUARANTEED to fool XOR checksums, because a burst (from the point of view of the receiver) is defined as the first and last bits both being flipped. If one of them wasn't flipped then it would only be an N bit (or less) error burst.

Berni · « **Reply #10 on:** January 06, 2024, 09:25:20 am »

To be fair no error detection method is 100% reliable.

As long as there is a considerable amount of data to be hashed down into a error detection word, you have a lot of combinations that can result in the same checksum output. So if the error detection word is 8 bit you always have a 1 in 256 chance that a completely random sequence of input bytes will appear to be correct.

It is more about picking a method that is most robust against the type of data corruption that might happen in the particular operation. A radio link might have a burst of complete garbage in the middle. A really long RS232 cable might suffer from an occasional flipped bit or lost byte. Some other transfer link might suffer from chunks of data getting swapped around into the wrong order. Some might guarantee the length and byte order to always be correct but commonly only 1 byte is garbage.

For some of these a dead simple 8bit sum might be enough. Heck for some even 1 bit parity might be enough. For other cases you might want 32bit of proper CRC, or for some cases you might want actual ECC codes so you can patch up errors in your data. But further you go the more processing power a MCU would have to dedicate to the task.

HwAoRrDk · « **Reply #11 on:** January 06, 2024, 09:53:04 am »

Have you tried the CRC16 implementation of avr-libc? It is a well optimised implementation written in assembly. Looks like it is branchless too, so should be pretty fast. Although probably still not as fast as a table-based algorithm.

Beware that it's inlined, so if you're calling it in multiple places in your code, you'll end up with multiple copies of it in the binary, which may not be helpful if you're tight on space.

m k · « **Reply #12 on:** January 06, 2024, 01:31:10 pm »

Quote from: Infraviolet on January 05, 2024, 09:37:44 pm

But it seems rather slow, quite a lot of clock cycles per byte to the extent that with a 16MHz clock rate my AVR chips are taking about 6.5us per byte to calculate checksums.

No idea where time is consumed but anyway.

i = SLR((crc_high_byte ^ data_byte), 4);
i = SLR((crc_high_byte ^ (data_byte << 4)), 4);

You can also add table pointer directly and avoid pgm_read call.

Siwastaja · « **Reply #13 on:** January 07, 2024, 08:05:54 am »

Quote from: m k on January 06, 2024, 01:31:10 pm

You can also add table pointer directly and avoid pgm_read call.

What do you mean? AVR is a weird architecture with separate address spaces for RAM and FLASH. You either access flash with these special instructions, or you copy the tables to RAM. Latter would be faster (and doesn't require special access code), but the amount of RAM in these small controllers is so little that I guess OP has to read the tables from FLASH.

brucehoult · « **Reply #14 on:** January 07, 2024, 08:34:13 am »

Quote from: Siwastaja on January 07, 2024, 08:05:54 am

What do you mean? AVR is a weird architecture with separate address spaces for RAM and FLASH.

Pretty much standard for 8 bit microcontrollers. AVR is probably the *least* weird of them. At least the flash is byte-oriented (not 12 or 14 or whatever bits wide, varying from model to model, like in PIC). And PIC doesn't give you instructions to read data from flash, but only computed GOTO into a table of load immediate instructions.

SiliconWizard · « **Reply #15 on:** January 07, 2024, 08:41:40 am »

Quote from: brucehoult on January 07, 2024, 08:34:13 am

And PIC doesn't give you instructions to read data from flash, but only computed GOTO into a table of load immediate instructions.

Yes, at least the PIC16F and similar, that was a real pain. I think the PIC18 made it better though. Still a separate instruction for reading from flash, but at least there was one.

David Hess · « **Reply #16 on:** January 07, 2024, 08:56:47 am »

Quote from: brucehoult on January 07, 2024, 08:34:13 am

And PIC doesn't give you instructions to read data from flash, but only computed GOTO into a table of load immediate instructions.

PIC can read from Flash using indirect addressing.

jfiresto · « **Reply #17 on:** January 07, 2024, 08:58:57 am »

Quote from: Infraviolet on January 05, 2024, 09:37:44 pm

But it seems rather slow, quite a lot of clock cycles per byte to the extent that with a 16MHz clock rate my AVR chips are taking about 6.5us per byte to calculate checksums....

It has been a few years, but I bet you could do better with a 16-bit CRC. I guided avr-gcc 3.4.6 to take less than 2µs per byte checking the flash of an ATMega88.

oPossum · « **Reply #18 on:** January 07, 2024, 02:16:33 pm »

Quote from: David Hess on January 07, 2024, 08:56:47 am

Quote from: brucehoult on January 07, 2024, 08:34:13 am
And PIC doesn't give you instructions to read data from flash, but only computed GOTO into a table of load immediate instructions.

PIC can read from Flash using indirect addressing.

Yea. About 10 or so years ago the 8 bit PIC architecture was updated to allow the index registers (FSR/INDF) to directly access flash memory and have linear access to the register file (RAM). This allow much cleaner and faster pointers in C.

DiTBho · « **Reply #19 on:** January 07, 2024, 05:03:43 pm »

Quote from: brucehoult on January 06, 2024, 01:14:11 am

I mentioned XOR

it's used by SONY in the { Playstation1, Playstation2 } for checksuming both the memory card and pad { joypad, mouse }

David Hess · « **Reply #20 on:** January 07, 2024, 05:07:21 pm »

I was thinking of combining some operation like add with rotate and it ends up the old BSD checksum does exactly that. It only requires two operations per byte or word and requires no extra storage or tables.

I wonder about the properties if exclusive-or replaced the add.

coppice · « **Reply #21 on:** January 07, 2024, 05:11:43 pm »

Remember that there is nothing magically wonderful about a CRC. They started out from people trying to protect, scramble or whiten serial bit streams in hardware, with as few gates as possible. There are many alternatives which are better when the data exists in parallel form, or a mix of serial and parallel, in the path being protected. There are certainly things which are a lot simpler in software.

m k · « **Reply #22 on:** January 08, 2024, 06:48:23 am »

Quote from: Siwastaja on January 07, 2024, 08:05:54 am

Quote from: m k on January 06, 2024, 01:31:10 pm
You can also add table pointer directly and avoid pgm_read call.

What do you mean?

Do "internally" what "external" thing is doing and avoid a call.

Siwastaja · « **Reply #23 on:** January 08, 2024, 07:39:49 am »

Quote from: m k on January 08, 2024, 06:48:23 am

Quote from: Siwastaja on January 07, 2024, 08:05:54 am
Quote from: m k on January 06, 2024, 01:31:10 pm
You can also add table pointer directly and avoid pgm_read call.

What do you mean?

Do "internally" what "external" thing is doing and avoid a call.

https://manpages.debian.org/jessie/avr-libc/pgm_read_dword_near.3avr.en.html
pgm_read_dword_near is not a function, there is no call. It makes the compiler emit the correct instructions to access the flash address space, inlined where it's "called". There is no advantage of doing the same in inline assembly (which you apparently suggest?)

Jeroen3 · « **Reply #24 on:** January 08, 2024, 08:06:32 am »

It's always a balance between code size and speed.

The people squeezing an 8bit avr to the limit still managed to fit an 512 byte table.
https://github.com/prusa3d/Prusa-Firmware/blob/d6250592b20f2475bbd3e8d967e5026fda016489/Firmware/Sd2Card.cpp#L478

They also have a table-less variant of CRC8.
https://github.com/prusa3d/Prusa-Firmware/blob/MK3/Firmware/mmu2_crc.h


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Faster alternatives to CRC? (Read 30651 times)

Share me