Author Topic: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors (Read 6512 times)

DiTBho · « **on:** November 21, 2023, 03:15:31 pm »

Imagine you have two HDDs in Raid1 (mirroring), formatted as { ext3, ext4, xfs } you copy a file, but one of the two disk Raid1 + {ext3, ext4, xfs} has several IO errors.

What will happen?

Well, { Ext3, Ext4, XFS, ... } have no checksum on data, only metadata have it, so ... if one of the two disks randomly fails (fail()={ data=IO_read; cache_data=data; write(data+random); } ) during a copy, the result will be silently currupted.

Except:
- Smarmon will log IO errors
- dmesg will output IO errors
- but "/bin/cp" will return without any error.

If you copied several files, e.g. with "cp -r", you don't know which file got corrupted during the copy!

Worse still, as long as the data remains in the disk cache, everything seems fine, syncing just forces pending writes, it doesn't clear the kernel-side cache.

Moral of the story: as soon as you un-mount and re-mount the disk, you have a good chance of finding a corrupt file (the copy is corrupt, the original is not, the two md5sums do not match), both on the healthy disk and on the disk that showed IO errors

- - -

If you use filesystems with checksums on both data and metadata, like { ZFS, Btrfs, ... } with at least two pooled disks (mirroring), then the whole filesystem does it. It's not exactly automatic that you will notice corruption, but there are tools already implemented that allow you to restore data.

ZFS immediately tells you which file has been corrupted.
Btrfs ... tells you indirectly, through dmesg, so you need to log everything and filter the output.

DiTBho · « **Reply #1 on:** November 21, 2023, 04:04:51 pm »

Defect?
In Raid1 with two storage devices, there is only one instance of the application and the application is not aware of multiple copies.

the redundancy exists in a layer transparent to the storage-using application:

when the application reads, the raid layer chooses the storage device to read
when a storage device fails, the Raid-layer chooses to read the other, without the application instance knowing of the failure

Marco · « **Reply #2 on:** November 21, 2023, 04:12:46 pm »

Where is the corrupted data supposed to come from? An IO error is not interpreted as data by the RAID driver.

DiTBho · « **Reply #3 on:** November 21, 2023, 04:20:54 pm »

Quote from: Marco on November 21, 2023, 04:12:46 pm

Where is the corrupted data supposed to come from? An IO error is not interpreted as data by the RAID driver.

p(hdd write error)=~1/1M

DiTBho · « **Reply #4 on:** November 21, 2023, 04:39:49 pm »

what happens:

if a write on one of the two disks is incorrect, { ext3m ext4, xfs, ... } doesn't warn you
you won't notice until you un-mount and re-mount the disk
worse yet, when you go back into Raid1, if it notices that two blocks are different, they are deleted

in practice you also lose the correct copy on the disk which did not show any errors

DiTBho · « **Reply #5 on:** November 21, 2023, 04:43:48 pm »

how to check on ZFS?

Code: [Select]

zpool scrub $pool_name

gf · « **Reply #6 on:** November 21, 2023, 06:29:49 pm »

If I/O to one disk of a RAID1 mirror fails, the failed disk is taken out of service and operation continues with the surviving disk. All I/O request from the file system to the virtual RAID1 device still succeed and there is no data loss or corruption, but redundancy is lost, and a second failure (of the surviving disk) cannot be tolerated until the failed disk has been replaced and resynchronized.

To the file system and application, the RAID device behaves as if it were a single (virtual) disk with higher availability. However, it is important to monitor the health of the RAID and replace a failed disk ASAP. If a disk failure is ignored, then sooner or later the 2nd disk will fail as well, and a (non-recoverable) double failure will occur.

There is also no redundancy while the mirror is being resynchronized (IOW, the source disk must not fail during the resynchronization). Note that resynchronization occurs not only after replacing a failed disk, but also when a RAID device w/o failed disks is reactivated after it was not shut down cleanly (e.g. after a system crash). EDIT: However, dirty region logging can speed-up the resynchronization after a system crash significantly, reducing the time window where the mirror runs w/o redundancy.

magic · « **Reply #7 on:** November 21, 2023, 06:50:18 pm »

Silent data corruption is always problematic, nothing to do with RAID1.

That being said, I'm not aware of any evidence that it occurs like you describe, with the disk receiving correct data and writing garbage to the platter. Platter contents are ECC protected and any irreparable corruption downstream of the HDD controller chip should be caught by it, resulting in properly reported I/O error and retry from another disk in a RAID environment. What could theoretically fail is the chip or its cache memory, if ECC isn't used there, or something along the way from the CPU to the disk (but note that SATA links have CRC and I believe PCIe has some error detection too).

What certainly does fail silently in the real world is garbage USB-SATA bridges. After one too many "adventure", I recently started transitioning all my USB disks to btrfs.

gf · « **Reply #8 on:** November 21, 2023, 06:59:00 pm »

Quote from: magic on November 21, 2023, 06:50:18 pm

Silent data corruption is always problematic, even without RAID1.

Of course.
And RAID still relies on the disk's and the bus' ECC to detect uncorrectable errors.
It just allows you to continue working with the other disk if such an error is detected.
Zfs and btrfs add an additional checksum layer.

Edit: Bit errors can also happen in main memory, and if they are not detected and you write garbled data to the disk, then the data on the disk are also corrupted.

DiTBho · « **Reply #9 on:** November 21, 2023, 07:01:17 pm »

Quote from: gf on November 21, 2023, 06:29:49 pm

If I/O to one disk of a RAID1 mirror fails, the failed disk is taken out of service

Only on severe failure, otherwise, since there is no write-readback the error silently happens.
I verified this in person!

DiTBho · « **Reply #10 on:** November 21, 2023, 07:06:55 pm »

Quote from: magic on November 21, 2023, 06:50:18 pm

I'm not aware of any evidence that it occurs like you describe

you can easily test this by modeling the behavior of two virtual disks. For one disk you model perfect behavior, for the other you assign the probability of failure every p(write_error) operations, then they put them in Raid1, format as ext4, and look at what happens trying to copy files.

magic · « **Reply #11 on:** November 21, 2023, 07:12:20 pm »

I question your proposed mechanism of silent data corruption, not the obvious fact that Linux RAID1 doesn't attempt to detect it.

You must have some shitty hardware if you see 1ppm silent corruption rate, and it's likely not the disk.

DiTBho · « **Reply #12 on:** November 21, 2023, 07:22:47 pm »

Quote from: magic on November 21, 2023, 06:50:18 pm

Silent data corruption is always problematic, nothing to do with RAID1.

The point is that there are ways to notice while this is happening(1) that the copy of a file is not the same as the original!

{ ZFS, btrfs, ... } offers a way to notice this
{ xfs, ext3, ext4, ... } does not, and you have to develop solutions.

For my NAS I wrote a special version of "cp" that verifies that the copy has the same md5 as the source. It doesn't help to restore anything (I use backups for that), but it immediately tells you that a file has been corrupted, which is better than nothing!

My version of "cp" does two things:

bypasses the kernel disk-cache for reading after writing, so it actually reads data blocks from disk, and not from RAM
reads n blocks { 8Kbyte, 16Kbyte, ... } from the source, writes-and-re-reads them, iteratively calculates, blocks after blocks, that the source and destination have the same checksum

It is much slower than "cp", about 3x times slower, but if the checksum does not match, it returns an error and can be logged.

(1) it may also be written correctly and become lately corrupted due to deterioration of the storage device, this case is not covered

DiTBho · « **Reply #13 on:** November 21, 2023, 07:29:45 pm »

Quote from: magic on November 21, 2023, 07:12:20 pm

I question your proposed mechanism of silent data corruption, not the obvious fact that Linux RAID1 doesn't attempt to detect it.

You must have some shitty hardware if you see 1ppm silent corruption rate, and it's likely not the disk.

yes, using "nbd" + a low-level backend written in C to model the behavior of a disk that has write errors can be... a stretch.

Surely the real hw behaves better, and above all SMART reports the symptoms and in any case takes this into account, which I cannot model with a virtual disk.

But hey? at least this is how Raid1 behaves in the worst case (=shitty hdds)

gf · « **Reply #14 on:** November 21, 2023, 08:02:09 pm »

Quote from: DiTBho on November 21, 2023, 07:01:17 pm

Quote from: gf on November 21, 2023, 06:29:49 pm
If I/O to one disk of a RAID1 mirror fails, the failed disk is taken out of service

Only on severe failure, otherwise, since there is no write-readback the error silently happens.
I verified this in person!

As magic said, the same also happen with a plain disk if the disk does not recognize the error and does not report a write error back to the driver.

If the badly written block is detected by the disk when you try to read it back later, then the RAID driver has at least the chance to retry the read from the other disk (with a single plain disk, you don't have this chance).

However, RAID1 is not expected to detect data corruption which is not detected and reported by the disk itself or the disk driver. Who promised it would? The data associated with the read and write I/O request is simply passed through transparently as long as the underlying disk driver does not abort any I/O request with error.

DiTBho · « **Reply #15 on:** November 21, 2023, 09:06:35 pm »

How to check on Btrfs?

Code: [Select]

# journalctl -k | grep "checksum error" | cut -d: -f6 | cut -d')' -f1 | sort | uniq

magic · « **Reply #16 on:** November 21, 2023, 09:08:14 pm »

I'm sure you can get total corruption event count from btrfs somehow. It is printed when the FS is mounted.

To find all currently bad files run btrfs scrub.

DiTBho · « **Reply #17 on:** November 21, 2023, 09:10:02 pm »

Linus Tech Tips fails at using ZFS properly, and loses data; read here

Foxxz · « **Reply #18 on:** November 22, 2023, 04:34:41 am »

Its a different layer's responsibility to ensure the data the filesystem is getting is correct. Some filesystems have just added additional redundancy. Hardware raid has been a thing forever and worked pretty dang well.

You could make the same argument about ECC memory vs non-ECC memory. The OS isn't doing any additional checks to ensure RAM contents are correct.

David Hess · « **Reply #19 on:** November 22, 2023, 05:46:01 am »

On my Areca RAID controllers, if the disk reports an error, then the data is read from the redundant disk, and steps are taken to make the disks consistent again. There is also another mode where the data is read from all disks and checked for consistence before being forwarded to the driver, which would at least catch disks silently corrupting data.

My preference would be that the filesystem also at least detect data and metadata corruption.

Marco · « **Reply #20 on:** November 22, 2023, 06:30:48 am »

Quote from: DiTBho on November 21, 2023, 04:20:54 pm

Quote from: Marco on November 21, 2023, 04:12:46 pm
Where is the corrupted data supposed to come from? An IO error is not interpreted as data by the RAID driver.

p(hdd write error)=~1/1M

If it's during the physical write the on disk CRC/ECC will be faulty, so it will return a read error when read. It can get corrupted during processing by some cosmic ray fluke, then again so can the data in the CPU before it computes a filesystem CRC.

DiTBho · « **Reply #21 on:** November 22, 2023, 08:17:13 am »

Quote from: Marco on November 22, 2023, 06:30:48 am

If it's during the physical write the on disk CRC/ECC will be faulty, so it will return a read error when read. It can get corrupted during processing by some cosmic ray fluke, then again so can the data in the CPU before it computes a filesystem CRC.

The ECC mechanism used with HDDs has a limited fault-tolerance, hence a probability of silent error.
It is on this that I modeled the behavior of the virtual disks on which I carried out the analysis.

In other words: suppose that a disk sucks so bad, or is so unlucky, that this probability amplifies(1):

How does the filesystem react and behave?
Do the kernel and the tools { cp, ... } in userspace at least tell me that something went wrong?

these are the points of experimental investigation!

(1) this conjecture makes sense, if we note that that probability can be modeled with approximately the same reasoning that is used to choose the cryptographic hash functions to have the property of "strong resistance to collisions": that is, the larger the disk , the more likely you are to make many more I/Os, and the more I/Os you make, the more you amplify that probability of a nefarious event (silent I/O error) happens.

Marco · « **Reply #22 on:** November 22, 2023, 08:38:45 am »

Quote from: DiTBho on November 22, 2023, 08:17:13 am

The ECC mechanism used with HDDs has a limited fault-tolerance, hence a probability of silent error.

It has a checksum too.

DiTBho · « **Reply #23 on:** November 22, 2023, 08:41:54 am »

Quote from: David Hess on November 22, 2023, 05:46:01 am

There is also another mode where the data is read from all disks and checked for consistence before being forwarded to the driver, which would at least catch disks silently corrupting data.

yup, in this case the Raid1 hw with proprietary extensions for enterprise services is better than the soft-Rraid1, precisely because the controller CPU on the PCI-board will check for consistence before being forwarded to the kernel, and in case will propagates an I/O error.

Unfortunately this approach is not portable to non-x86 architectures for commercial reasons, and the implementations I have seen are strongly PCI-bus master, so I cannot consider it.

I'm only using softRaid1 implementations, even with very dummy-cards that offer only two sATA or SCSI channels, leaving all the work to the kernel: this is portable, but slower and weaker.

gf · « **Reply #24 on:** November 22, 2023, 09:43:43 am »

Quote from: DiTBho on November 22, 2023, 08:41:54 am

Quote from: David Hess on November 22, 2023, 05:46:01 am
There is also another mode where the data is read from all disks and checked for consistence before being forwarded to the driver, which would at least catch disks silently corrupting data.
yup, in this case the Raid1 hw with proprietary extensions for enterprise services is better than the soft-Rraid1, precisely because the controller CPU on the PCI-board will check for consistence before being forwarded to the kernel, and in case will propagates an I/O error.

"Better"? Not in terms of performance, because reading both copies and comparing them prevents striping of concurrent read requests across disks, and the latter can almost double throughput for some read-intensive workloads.

Quote

Unfortunately this approach is not portable to non-x86 architectures for commercial reasons, and the implementations I have seen are strongly PCI-bus master, so I cannot consider it.

I'm only using softRaid1 implementations, even with very dummy-cards that offer only two sATA or SCSI channels, leaving all the work to the kernel: this is portable, but slower and weaker.

Why do you think that reading and comparing both copies could not be implemented in software RAID, too? Of course there is not free lunch -- it costs CPU power and memory bandwidth.

[ If you use ZFS, you don't need a separate software RAID anyway, since ZFS has RAID and volume manager functionality built-in. I do not know if the same applies to BTRFS -- I am not familiar with it. ]


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Raid1 + {ext3, ext4, xfs} is problematic if a disk silently has IO errors (Read 6512 times)

Share me