Imagine you have two HDDs in Raid1 (mirroring), formatted as { ext3, ext4, xfs } you copy a file, but one of the two disk Raid1 + {ext3, ext4, xfs} has several IO errors.
What will happen?
Well, { Ext3, Ext4, XFS, ... } have no checksum on data, only metadata have it, so ... if one of the two disks randomly fails (fail()={ data=IO_read; cache_data=data; write(data+random); } ) during a copy, the result will be silently currupted.
Except:
- Smarmon will log IO errors
- dmesg will output IO errors
- but "/bin/cp" will return without any error.
If you copied several files, e.g. with "cp -r", you don't know which file got corrupted during the copy!
Worse still, as long as the data remains in the disk cache, everything seems fine, syncing just forces pending writes, it doesn't clear the kernel-side cache.
Moral of the story: as soon as you un-mount and re-mount the disk, you have a good chance of finding a corrupt file (the copy is corrupt, the original is not, the two md5sums do not match), both on the healthy disk and on the disk that showed IO errors
- - -
If you use filesystems with checksums on both data and metadata, like { ZFS, Btrfs, ... } with at least two pooled disks (mirroring), then the whole filesystem does it. It's not exactly automatic that you will notice corruption, but there are tools already implemented that allow you to restore data.
ZFS immediately tells you which file has been corrupted.
Btrfs ... tells you indirectly, through dmesg, so you need to log everything and filter the output.