Author Topic: Checksums on Linux for file repair (Read 2631 times)

Infraviolet · « **on:** May 25, 2024, 07:46:46 pm »

Ok, so one can take checksums of files on Linux by doing:
sha256sum *
in a folder to get the terminal to calculate a sha256 string for each file in the folder.

But Sha256 is designed for security, the main property being that it is extremely hard to reverse engineer, an attacker has very little hope of being able to modify a file in such a fashion that after modification it gives the same checksum despite the changes made. So by its very nature one doesn't have much hope of taking a sha256 checksum and a slightly corrupted file and being able to tell from the checksum of the corrupted file vs the sha256 of the file before corruption, what the corruption was and which bits if flipped could reverse it.

Is there a better checksum type available in the linux terminal for this sort of thing? As a checksum which could be transferred along with big (hundreds of Mb to some Gb) files and if the file when the checksum is calculated at the other end doesn't match the checksum, you'd have a better chance, for a small amount of corrupt bits, of being able to use the checksum as a hint for where to reverse the corruption.

I know a checksum string length broadly approximating a sha256 sum's length isn't going to be great for fixing corruption, but I bet there's a more appropriate terminal tool available for this than sha256sum.

One can also consider situations where one is trying to reconstruct a large file from two copies of that file, each with tiny amounts of corruption in different places, plus the checksum which either would have were it not corrupted at all.

Any ideas of what woud be a good checksum to use?

I'm not trying to reconstruct a corrupted file here, rather to know what checksums to take, beforehand, for files so I could do that as an element within future projects relating to transferring files.

Thanks

nctnico · « **Reply #1 on:** May 25, 2024, 07:55:59 pm »

Some of the older checksum algorithms like MD5 and SHA1 are 'broken' and can be reversed. Look here for a start: https://en.wikipedia.org/wiki/Rainbow_table
I just don't know whether this is doable for a large file.

Then again, it would be better to use an error correction algorithm and transfer that along with the file. Then you'll have a chance of correcting errors as well: https://humphryscomputing.com/Notes/Networks/data.error.html

radiolistener · « **Reply #2 on:** May 25, 2024, 08:52:42 pm »

I'm using md5, it is more lightweight, so it should run more fast, but I didn't performed testing how md5 is faster than sha256 on rpi4

Regarding to error-correcting, checksum algorithms are not intended to do error-correction.
If you want error-correction you're needs to encode file with some error-correcting codes like Reed–Solomon codes.

As I remember, RAR archive supports error-correction codes if you enable it, but note that it will increase archive size because it needs to store redundant information which is needed for recovery.

shapirus · « **Reply #3 on:** May 25, 2024, 09:01:17 pm »

Quote from: Infraviolet on May 25, 2024, 07:46:46 pm

Is there a better checksum type available in the linux terminal for this sort of thing?

Better in terms of?

Quote from: Infraviolet on May 25, 2024, 07:46:46 pm

As a checksum which could be transferred along with big (hundreds of Mb to some Gb) files and if the file when the checksum is calculated at the other end doesn't match the checksum, you'd have a better chance, for a small amount of corrupt bits, of being able to use the checksum as a hint for where to reverse the corruption.

What protocol will be used to transfer the file? Or, rather, what is the application?

Reasonable checksumming for error detection and retransmission is normally implemented in both hardware and software layers, when we're talking generic TCP/IP transmissions. That means there's little chance of receiving a corrupted file, thus it's fine (in terms of resource usage) to calculate even a sha256 sum for entire file for a final validation.

If for some reason you need more fine-grained checksumming, then, depending on application, you can calculate checksums block by block as data is transmitted and received, or block by block for the file stored on disk. In the latter case, I don't know of any ready made tools, but it is fairly easy to accomplish with some scripting (can be pure shell), dd, and sha256sum. The smaller the block size, the finer resolution it will have in telling the location of the first mismatching block.

janoc · « **Reply #4 on:** May 25, 2024, 09:08:36 pm »

Hashes like MD5 or SHA256 are not designed for this use. You can't "reverse" the hash and find from it which part of the file has been corrupted - that's not possible by the mathematical definition of the hash, that information is lost.

Rainbow tables won't help for arbitrary files, they work by precomputing all possible hashes for a given size of the hashed data and storing them, so that they original data can be simply looked up when one sees a hash. That's fine for something like finding an encryption key (e.g. 128 bits of data) but not for recovering a content of a file the hash was computed from - the rainbow table would be intractably huge.

The reason stuff like SHA1 or MD5 are not recommended for use is not rainbow tables but that it is possible to produce hash collisions - create a file that has the same has as another, different file. Which is bad juju when you are using the hash as part of some cryptography setup to ensure that something hasn't been tampered with. For basic file integrity check e.g. when transferring files over the network they are still fine.

If you want to not only detect the corruption but also to correct it (up to some amount of corruption) without retransmission (which is often the simpler way of dealing with the problem), you need to use error correction codes, e.g. some form of Hamming code. See here:

https://en.wikipedia.org/wiki/Error_correction_code

The idea is that you encode the data you are transmitting in some way and include certain amount of redundant information. Then when decoding the data you can both discover that it has been corrupted and also to correct certain amount of errors using that redundant information. The disadvantage is that you are trading the security for space/bandwidth because you must transmit/store the extra information. As always, there is no free lunch.

golden_labels · « **Reply #5 on:** May 25, 2024, 11:01:55 pm »

Infraviolet:
Parchive is what you are asking for.

Also, forget all the talk about rainbow tables etc. This is irrelevant to your problem. Even if a hash function no longer offers preimage resistance⁽¹⁾ or tadeoffs may be make for repeated attacks,⁽²⁾ your goal is not finding a matching value, but the value. Preimage attacks do only the first.

⁽¹⁾ So far all popular hash functions — MD5, SHA1, SHA2 and SHA3 — have no practical preimage attacks against them.
⁽²⁾ As in the rainbow tables approach, where space is traded for speed for repeated attacks.

m k · « **Reply #6 on:** May 26, 2024, 11:48:16 am »

Quote from: Infraviolet on May 25, 2024, 07:46:46 pm

I'm not trying to reconstruct a corrupted file here, rather to know what checksums to take, beforehand, for files so I could do that as an element within future projects relating to transferring files.

Think fault tolerance of RAID.
Model 6 is 2/6 and model 1 is 1/2 minimum.

So multiple copies is better against everything of that category.
It may still need checksums but those can be pretty simple.

Without checksums it's a voting system.
Not very good if throughout is very uncertain.

For transfer here we use TCP/IP.
Back in the day MTU had nothing to do with quality of any kind.
It was the size that was guaranteed to go through, no idea how it is today.

janoc · « **Reply #7 on:** May 26, 2024, 03:49:08 pm »

Quote from: m k on May 26, 2024, 11:48:16 am

For transfer here we use TCP/IP.
Back in the day MTU had nothing to do with quality of any kind.
It was the size that was guaranteed to go through, no idea how it is today.

TCP/IP only ensures checksum per packet but that doesn't prevent corruption of a file being transferred. The data will get split into packets and then recombined at the receiving end. If that data gets corrupted once it is out of the TCP/IP stack because of a bad memory on some router somewhere, recombined wrong due to a firmware bug, etc. the TCP checksums don't help you at all because the individual packets were transferred fine.

And that assumes you are using TCP (e.g. HTTP or FTP protocol). If the protocol is something custom, running over UDP, then all bets are off and you are on your own - you could have lost packets, packets that got duplicated, delivered out of order, etc. Hopefully that system implements its own integrity checks.

It is very easy to get a corrupted file transfer with larger files. That's why most file distributions schemes use something like MD5, SHA1, SHA256 and similar hashes to detect that the file has changed during transfer and the download is likely bad, regardless of TCP having its own checksums on the individual packets. Data don't get corrupted only "on the wire".

Concerning MTU - that is not the size that is "guaranteed to go through". It is the maximum data size that could be transferred over the link without splitting it (fragmentation) into multiple packets. Could because it is not guaranteed it won't be fragmented even if the buffer size is smaller - that's completely up to the network stack. MTU only guarantees that anything larger than this value will be fragmented.

m k · « **Reply #8 on:** May 26, 2024, 05:42:12 pm »

Yes, guaranteed to go through for MTU was inaccurate.
Back in the day parts of the network didn't guarantee the integrity of fragments.

I'd say the first thing to figure out is partial or not.
Then how much time or space can be wasted.

If partial is not accepted then multiple copies with simple checksums is how backups do.
There time is limited but not critical and size is sort of unlimited.
Backup here is a concept, not a single act.

If partial is accepted and completed transfer stored then some checksums are again needed, or storage completely trusted.
A computer mass storage is not without checks, mostly the same with work memory.

radiolistener · « **Reply #9 on:** May 26, 2024, 05:57:39 pm »

Quote from: janoc on May 25, 2024, 09:08:36 pm

The reason stuff like SHA1 or MD5 are not recommended for use is not rainbow tables but that it is possible to produce hash collisions - create a file that has the same has as another, different file.

If you're don't do it intentionally using special methods and algorithm weak points, the probability that it can happens almost zero. You can use md5 and sha1 with no issues for file integrity checks. The same you can use even CRC32 it is well enough for file integrity checks.

It should be avoid if you're using it for cryptography like electronic signature verification, because there is known vulnerabilities. But sha256 also vulnerable.

soldar · « **Reply #10 on:** May 26, 2024, 10:04:58 pm »

Checksums and hashes are only intended to verify the integrity of data, not to provide error correction.

If you want to be able to correct errors without retransmitting the entire file then the obvious thing to do is to break it up in chunks and send each chunk with its hash and then reassemble the file at the other end.

Have a look here
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction

nctnico · « **Reply #11 on:** May 26, 2024, 10:23:53 pm »

Some interesting fiction on this subject: https://en.wikipedia.org/wiki/Sloot_Digital_Coding_System
https://jansloot.telcomsoft.nl/Sources-1/More/CaptainCosmos/Not_Compression.htm

Foxxz · « **Reply #12 on:** May 27, 2024, 02:36:49 am »

You are looking for forward error correction (FEC). Some types, such as Reed Solomon, have been suggested. Hamming codes are another https://en.wikipedia.org/wiki/Hamming_code
Such codes are used in things such as audio CDs to correct audio data during playback to lessen the impact of scratches and dirt on the disc https://en.wikipedia.org/wiki/Cross-interleaved_Reed%E2%80%93Solomon_coding

Nominal Animal · « **Reply #13 on:** May 27, 2024, 09:24:23 am »

For detection of file integrity, no error correction support, I recommend using b2sum (part of coreutils, so if you have sha256sum, you have b2sum too), which calculates a 512-bit checksum using the BLAKE2b hash function. On typical hardware it is faster than md5sum or sha1sum, and much (3×) faster than sha224sum/sha256sum etc.

For filesystem integrity checks and change detection, running "ionice -c 3 nice -n 20 find root -type f -print0 | ionice -c 3 nice -n 20 b2sum -bz -" generates a list of paths and checksums that "ionice -c 3 nice -n 20 b2sum -cz file" can check. Both commands run in the background, idle, and should not significantly slow down anything else running at the same time. Note that the list file uses NUL ('\0') as the separator to support all possible file names, instead of newline, so you need to use e.g. tr '\0' '\n' < file | less to view it.

For error correction, as mentioned by golden_labels, Parchive (PAR2 format) is what comes to my mind as closest existing common tool.

I've personally found that fixing corruption using such a scheme is not worthwhile, compared to storing multiple copies in physically separate media. I do use checksums to detect file corruption, but instead of trying to fix the corruption I distribute copies physically in the hopes that at least one of them maintains integrity. I've found correctable errors (only a tiny part of the file corrupted) rather rare, compared to losing entire media, especially when using Flash (cards, USB sticks) for backups.

There are many filesystems with error detection and correction built-in, so one option is to use a filesystem image (via a loopback device) to store the data. Essentially, you use losetup to setup a block device backed by your filesystem image. Initially, you format it with the desired filesystem, and then mount the loopback device, at which point it becomes accessible. Afterwards, you unmount the filesystem, fsync to ensure the image is sync'd to storage media, and detach the loop device. Of course, if you use dedicated media like Flash or spinny-rust drives, it'd make sense to format the device with that filesystem in the first place.

If there was an actual use case for individual bit flips (as opposed to entire sectors lost), one could use e.g. libcorrect to write an utility that optionally compresses the input file, then generates an output file with Reed-Solomon codes, and another that decodes such files and optionally decompresses it. libcorrect is often used for Software Defined Radio (SDR), and regardless of the recent infrastructure attack on xz-utils, I would use xz as the optional precompressor/postdecompressor.

DiTBho · « **Reply #14 on:** May 28, 2024, 12:23:06 am »

Quote from: Nominal Animal on May 27, 2024, 09:24:23 am

Software Defined Radio (SDR)

I think a further example is P2P


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Checksums on Linux for file repair (Read 2631 times)

Infraviolet

Checksums on Linux for file repair

nctnico

Re: Checksums on Linux for file repair

radiolistener

Re: Checksums on Linux for file repair

shapirus

Re: Checksums on Linux for file repair

janoc

Re: Checksums on Linux for file repair

golden_labels

Re: Checksums on Linux for file repair

m k

Re: Checksums on Linux for file repair

janoc

Re: Checksums on Linux for file repair

m k

Re: Checksums on Linux for file repair

radiolistener

Re: Checksums on Linux for file repair

soldar

Re: Checksums on Linux for file repair

nctnico

Re: Checksums on Linux for file repair

Foxxz

Re: Checksums on Linux for file repair

Nominal Animal

Re: Checksums on Linux for file repair

DiTBho

Re: Checksums on Linux for file repair

Share me