Bit rot is real

On 2014-01-30 20:55, Larry Finger wrote:

> @ZStefan: I know your mind is probably made up and we should not bother
> you with facts, but we may still influence some other readers of this
> thread. Before you contribute to FUD, please read on how ECC works.

Well, the btrfs devs certainly added this data recovery mechanism. They
would not have bothered to if the hardware could always cope :slight_smile:


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)

My mind, even though so small (the smaller the brain, the happier the creature is), is not made up. I am trying to find a solution.

I wouldn’t even bother about high-level file verification tools shouldn’t I have encountered the following cases, over about 8 years, that I remember. In all cases, I checked or tried to check the disk for errors, and couldn’t find any with reasonable effort. Most likely, the data corruption ocurred during the years of storage.

Once I had to discard a very large zip file. It became unzippable when it was needed months later. I usually check backups, through ‘diff -r’, and probably have checked the file right after creation. Data was lost forever, since the original was deleted to use the disk space for other purposes.

I had to discard, because of data corruption, two large data files, containing acquired data. As the data was acquired, the files were reviewed and tested and nothing wrong was noticed. This was painful, I had to explain to colleagues. I have kept one of the corrupted files.

I have a relatively short, old video file, in which the distortion described in the original article occurs. When it was played after acquisition years ago, I don’t remember seeing the distortion.

I had to edit a few audio files which contained the violent chirp described in the article. The durations of chirps were short and the edits were successful. I don’t remember when the chirps appeared but I have the impression that they came in gradually.

While there can be other explanations, bit rot is on my mind, considering the way the corruption appeared.

I agree with this. An simple automatic refresh may be disastrous. A more complex refresh, with comparison and late deletion of original, might be better though.

But I am not thinking about refreshing the file. I worry about corruption gone undetected. Namely, I would like to use an atomatic tool to compare the checksums of the source file and of the backup file, to alarm me if they are different. Otherwise, if the source is corrupted, by the next backup, which will likely take place on the same HD, the data will become irreparably corrupted.

I read about and tried to install.

The goal is exactly what I want, but the usability is not there. The implementation has problems with large files. The last substantial update was in 2006.

However, I downloaded and tried to compile. Of course didn’t work. There were multiple warnings and an error stopped the compilation, even though ./configure was successful. I short, not usable.

Someone is working on PAR3. Their forum shows some activity in 2011-2012.

However, every effort in the Parchive project is abandoned, buggy, experimantal or incomplete.

It looks like I have to create the checksums manually…

ZStefan wrote:
> But I am not thinking about refreshing the file. I worry about
> corruption gone undetected. Namely, I would like to use an atomatic tool
> to compare the checksums of the source file and of the backup file, to
> alarm me if they are different.

Rsync will happily do that for you.

On 2014-01-31 13:21, Dave Howorth wrote:
> ZStefan wrote:
>> But I am not thinking about refreshing the file. I worry about
>> corruption gone undetected. Namely, I would like to use an atomatic tool
>> to compare the checksums of the source file and of the backup file, to
>> alarm me if they are different.
>
> Rsync will happily do that for you.

Mmm, not really.

It compares the source with the backup, yes. If the source was damaged,
as described, the backup will be faithful to the source; the checksum
would match. But the backup would have the damaged file.


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)

Carlos E. R. wrote:
> On 2014-01-31 13:21, Dave Howorth wrote:
>> ZStefan wrote:
>>> But I am not thinking about refreshing the file. I worry about
>>> corruption gone undetected. Namely, I would like to use an atomatic tool
>>> to compare the checksums of the source file and of the backup file, to
>>> alarm me if they are different.
>> Rsync will happily do that for you.
>
> Mmm, not really.
>
> It compares the source with the backup, yes. If the source was damaged,
> as described, the backup will be faithful to the source; the checksum
> would match. But the backup would have the damaged file.

Duh, no. You check the checksums before you make the backup!

Now how you know whether the checksums are different because of ‘bit
rot’ or because you changed the file and really do want to make a new
backup, that is an exercise for the reader.

On 2014-01-31 14:41, Dave Howorth wrote:

>>> Rsync will happily do that for you.
>>
>> Mmm, not really.
>>
>> It compares the source with the backup, yes. If the source was damaged,
>> as described, the backup will be faithful to the source; the checksum
>> would match. But the backup would have the damaged file.
>
> Duh, no. You check the checksums before you make the backup!

Of course we do.

That’s not the issue. The problem described here is when the existing
copy of the file gets damaged while stored.

· You have the original, assumed correct.
· You make a backup, with rsync with checksum verification.
· The backup will be correct.
· You store it.
· The original is erased.
· At some time, some small corruption happens on the backup (or rather,
archive), and the hardware does not detect it. The chance is very small,
but as the stored data is huge, the possibility becomes real. Small, but
real.
· The backup is now bad.
· Any other backup will be correct compared to the previous backup, but
not to the original.

Notice that rsync with checksums does not store the checksums. It simply
calculates the checksums of both original and backup; if they differ,
the file is copied again. But those checksum are not stored, so it is
not possible to verify the integrity of the backup if you do not have
the original files.


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)

Carlos E. R. wrote:
> On 2014-01-31 14:41, Dave Howorth wrote:
>
>>>> Rsync will happily do that for you.
>>> Mmm, not really.
>>>
>>> It compares the source with the backup, yes. If the source was damaged,
>>> as described, the backup will be faithful to the source; the checksum
>>> would match. But the backup would have the damaged file.
>> Duh, no. You check the checksums before you make the backup!
>
> Of course we do.
>
> That’s not the issue. The problem described here is when the existing
> copy of the file gets damaged while stored.

No. The problem described was:

“I would like to use an atomatic tool to compare the checksums of the
source file and of the backup file, to alarm me if they are different.”

On 2014-01-31 07:16, ZStefan wrote:

>> par2.

> I read about and tried to install.
>
> The goal is exactly what I want, but the usability is not there. The
> implementation has problems with large files. The last substantial
> update was in 2006.

I have used it with DVDs, holding files under 500MB each.

> However, I downloaded and tried to compile. Of course didn’t work. There
> were multiple warnings and an error stopped the compilation, even though
> ./configure was successful. I short, not usable.

No need to build, it is available on the distro, from some home repos.
Version 0.4

I built it myself years ago, version 0.4. I would have to find out if it
still builds.

There is also “kde3-kpar2” on the kde3 repo.

I see there is a library with this description:

+++··········
libpar2-0 - Library for Performing Common Tasks Related to PAR Archives

LibPar2 allows for the generation, modification, verification, and
repairation of PAR v1.0 and PAR v2.0 (PAR2) recovery sets. It contains
the basic functions needed for working with these sets and is the basis
for GUI applications such as KPar2 and GPar2.
··········+±

So the next step is finding that Gpar2 (KPar2 is the kde3-kpar2 above).
I don’t see it on the distro repos. …] Hum. It is on parchive from
sourceforge.

I also see “par2cmdline” on some home repos, not the same repos as for
par2, even though they could be the same program.

> Someone is working on PAR3. Their forum shows some activity in
> 2011-2012.

Oh.


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)

On 2014-01-31 15:03, Dave Howorth wrote:
> Carlos E. R. wrote:

> No. The problem described was:
>
> “I would like to use an atomatic tool to compare the checksums of the
> source file and of the backup file, to alarm me if they are different.”

Well, the paragraph then is confusing, because what he really means is
what I explained :slight_smile: Surely he will confirm.

I’ll try to rewrite that paragraph.

Create checksums of files on source and/or backups, so that if later
(after the backup is made and verified as correct) a file in any of the
two changes, the change can be detected using the checksum on that same
side. To alarm the user if a file changes, any time (months, years)
after the backup was made and verified.


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)

Thanks for help and discussions, and sorry for confusion. The discussion shows to me that there is more to the verified storage than I though earlier.

Let me reformulate my wish by combining several ideas.

To create a tool that will work automatically and traverse the directories recursively doing the following:

  1. Create checksums of all files satisfying certain conditions (e.g., size > 1 MB). Write checksums in a simple text file.

  2. After creation of checksum files, calculate checksums periodically automatically, or on demand. Do not overwrite the existing checksum files.

  3. If a calculated checksum differs from the stored one, then alarm the user.

  4. If there are one or more backups made, then do the same in each of the backup storage areas, writing the checksums in a writable place (in the same folder in most cases), once.

  5. If there are backups made, then also compare the calculated checksums in different storage areas. Alarm the user if a checksum does not match.

The usage of this software might look like this. We assume that all storage areas are writable.

Running for the first time to create checksums in two directories, one of which is a nfs-mounted directory:

checksums -r --create --filter="*.dat & size>1MB" --save-in-same-dir /storage /remote_mnt/backup_storage 

Checking for corruption. Computes files’ checksums:

checksums -r --verify --source-in-same-dir    /storage   
checksums -r --verify --source-in-same-dir    /remote_mnt/backup_storage 

Comparing checksums in different storages. This is a somewhat redundant operation, and possibly a script based on the diff command can do the same. Files’ checksums are not computed:

checksums -r --compare --source-in-same-dir /storage /remote_mnt/backup_storage 

That’s all! I will be assured that the files have not gone corrupted. If there is checksum mismatch, I will look at files instead of making the next backup.

I have installed kde3-kpar2. The program does nothing. Literally.

It cannot compute checksums, cannot work with directories and does not traverse directories. No help, no manual. It is an empty skeleton.

On 2014-02-01 10:46, ZStefan wrote:
>
> I have installed kde3-kpar2. The program does nothing. Literally.
>
> It cannot compute checksums, cannot work with directories and does not
> traverse directories. No help, no manual. It is an empty skeleton.

Wow :frowning:


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)

There are security tools that do this. They are intended to find files that might have been infected by a virus. But they should also work for your purposes.

I’m just suggesting that you broaden where you are looking.

I have found a draft report on data corruption of various types, from 2007.

www.streamscale.com/media/pdf_folder/CERN_Data_Corruption.pdf

They have monitored many files on many computers.

The article is not entirely understandable for outsiders like me. Sometimes, in my opinion, they don’t clearly state whether they speak about correctable or uncorrectable errors. Overall, they confirm that the bit errors caused by disk are in the range one bit per 10^14. But there are several other sources of errors. Their conclusion is that it is necessary to continuously monitor the saved data, but it doubles the load on hardware.

The study took place at approximately the same time as the famous Google’s study:
http://research.google.com/archive/disk_failures.pdf

Below are a few excerpts.

“This program writes a ~2 GB file containing special bit patterns
and afterwards reads the file back and compares the patterns.
This program was deployed on more than 3000 nodes (disk server,
CPU server , data bases server, etc.) and run every 2h. About 5
weeks of running on 3000 nodes revealed 500 errors on 100 nodes.”

“We have 44 reported memory errors (41 ECC and 3 double bit)
on ~1300 nodes during a period of about 3 month.”

" files on a disk pool were checked and the previously calculated
checksum on tape was compared with another adler32
calculation. During a test 33700 files were checked (~8.7 TB)
and 22 mismatches found."

“There is a clear distinction between ‘expected’ errors based on
the vendors reliability figures and obvious bugs/problems in the
hardware and software parts.”

" different protection mechanisms don’t work 100% , i.e. not every
errors in the data flow is correctly treated and reported to the
upper layers."

“We have established that low level data corruptions exist and
that they have several origins.”

On 2014-02-28 16:36, ZStefan wrote:
>
> I have found a draft report on data corruption of various types, from
> 2007.
>
> www.streamscale.com/media/pdf_folder/CERN_Data_Corruption.pdf

Very interesting.

I’d extract two paragraphs:

«Remark : This number is only true for non-compressed files. A test with
10000 compressed files showed that with a likelihood of 99.8 % a SINGLE
bit error makes the whole file unreadable, thus the data loss rate would
be much higher for compressed files»

which is why I do not currently use compressed backups.

«4. Writing
All data are encoded with an ECC algorithm which is capable to correct
multiple 64 Kbyte blocks of data.
→ must be a special algorithm as normal ones are coping with 10s of byte
corrections only. At least 20% more data data needs to be written. More
CPU resources needed to do the encoding.»

Which I think I mentioned before. We need data storage methods with
error correction methods, not simply error detection.


Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)

Seagate recently introduced 6 TB hard drives which have six platters.

They have a measure of bit error rate that I see for the first time:

“Nonrecoverable Read Errors per Bits Read: 1 sector per 10E15”
(1 sector is 512 bytes).

Used to be 1 bit per 10E15 for enterprise HDs and 1 bit per 10E14 for consumer HDs.

Now, does switching from 1 bit to 1 sector mean increase in error rate 500 times? I think this is some redefinition, which makes one question the earlier well-known rate of “1 bit per 10E14”.

Probably that “1 bit per 10E… bits read” is a meaningless, useless or suspicious number. Maybe not useful even for rough estimates - its definition changes, and it is difficult to believe that the error rate is ten times different between enterprise and consumer HDs given the studies that show that those two classes function similarly in terms of reliability.

And there is nothing said about write error rates. Perhaps because they cannot be measured separately?

What? Every large corporation uses terabytes or more of hard drives in various RAID configurations for extremely extended amounts of time without issue. We have systems at my work that are over a decade old, running RAID without any random file corruption issues.

If you are running RAID with daily or weekly scheduled volume checks and ECC RAM, your data will not get corrupted no matter how long the system runs. The equipment will go obsolete or the drives will wear out before your files miraculously become unreadable due to “bit rot”.

On 2014-12-16 22:26, bjd223 wrote:
> runs. The equipment will go obsolete or the drives will wear out before
> your files miraculously become unreadable due to “bit rot”.

Did you read the original report?


Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)

This reminds me of an online forum. It used RAID and had everything under control.

One day, a couple of years ago, there was a bad power outage at the data center. When the power came back on, the RAID controller did not come up. They were unable to find an identical RAID controller.

They eventually recovered the data, but it took a month.