Data Transfer integrity

This question might be better suited for a computer science forum than a hardware, sorry.

My machine has 2 disks, the main-one for the OS (V13.1 64) plus my data, and the second-one just for a copy of my data (my backup).

I usually make this copy using rsync but sometimes with a file copy and paste.

During a fresh install of V13.2, the main harddisk starts clicking, the dvd installation takes over an hour and I realize there’s a problem.

I add a new main disk, the fresh install goes smoothly, less than 15 minutes, no problems at all (good job guys thank you).

I am able to copy my data from both disks, but I prefer to use the data from the old main disk because the second disk copy is a few weeks old.

Here’s my question.

Q1) How dependable is a file copy, how dependable is rsync?

From my DOS days, I used the /verify option whenever copying, and bad transfers were detected…

Q2) How do you prove that a terabyte sized directory copy is identical?

Q3) Is there a duplication detection program?

Thank you.

rsync with -c parameter generates checksums for the files and is very reliable.

On 2014-12-07 22:56, rih5342 wrote:

> Here’s my question.
>
> Q1) How dependable is a file copy, how dependable is rsync?
>
> From my DOS days, I used the /verify option whenever copying, and bad
> transfers were detected…

I remember.

rsync is very reliable, because it does checksums. I’m unsure if they
are always done, or you you have to specify with the “–checksum” option.

I believe that the first time it does a checksum. A second run only
checks timestamps and size, unless you set that option in which case it
does the full check on all files.

man says:

-c, --checksum skip based on checksum, not mod-time & size

-c, --checksum
This changes the way rsync checks if the files
have been changed and are in need of a transfer.
Without this option, rsync uses a “quick check”
that (by default) checks if each file’s size and
time of last modification match between the sender
and receiver. This option changes this to compare
a 128-bit checksum for each file that has a match-
ing size. Generating the checksums means that
both sides will expend a lot of disk I/O reading
all the data in the files in the transfer (and
this is prior to any reading that will be done to
transfer changed files), so this can slow things
down significantly.

The sending side generates its checksums while it
is doing the file-system scan that builds the list
of the available files. The receiver generates
its checksums when it is scanning for changed
files, and will checksum any file that has the
same size as the corresponding sender’s file:
files with either a changed size or a changed
checksum are selected for transfer.

Note that rsync always verifies that each trans-
ferred file was correctly reconstructed on the
receiving side by checking a whole-file checksum
that is generated as the file is transferred, but
that automatic after-the-transfer verification has
nothing to do with this option’s before-the-trans-
fer “Does this file need to be updated?” check.

For protocol 30 and beyond (first supported in
3.0.0), the checksum used is MD5. For older pro-
tocols, the checksum used is MD4.

You may be interested that some filesystems are either implementing or
about to implement checksum check, storing it amongst the “classical”
metadata. btrfs has it, I think, and xfs may have it. They certainly
were talking about doing it.

> Q2) How do you prove that a terabyte sized directory copy is identical?

My way is to create, on each directory, an md5sum list of every file,
which can be done on a single command:


md5sum -b * > checksums

you copy the “checksums” file to the other directory, then run:


md5sum -c --quiet checksums

Another possibility is using “mc”, which has a directory compare
functionality.

> Q3) Is there a duplication detection program?

Yes, fdupes. Quite spartan.


Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)