Zip or tar causing loss?

ZStefan · December 7, 2011, 6:28am

I have encountered this twice, and want to know if you also have experienced data loss because of zip+tar.

I created a huge tgz file with
tar -f my.tgz -v -c -z …

The purpose was backup, and the file was around 50 GB.

After a few months, I could not restore the content. There was an error message probably from zip library. After attempting some rescue measures, I understood that the file could not be recovered.

I decided to check the hard drive on which the file resided. badblocks in non-destructive and destructive modes reported no errors, and there was no reason to believe that hardware is at fault.

Now I do not use zip for large files.

It is surprising that such an old and tested software can fail (perhaps).

noident · December 7, 2011, 7:18am

The -z option to tar tells it to use gzip (actually libgz), not zip.
gzip and zip are 2 totally different things.
I suspect that you messed something up, possibly trying to unzip the .tgz file, where you should have just untar/uncompressed it using ‘tar -zxf file.tgz’
I’ve never seen Gnu tar screw up, so it’s very unlikely that it’s at fault here.

caf4926 · December 7, 2011, 7:22am

Why archive it in the first place. Just another complication to the mix IMO

ZStefan · December 7, 2011, 7:40am

I used gzip, evidently. I didn’t know that -z option calls libgz which is different from zip.

I did the tar-gzip correctly, as usual. I checked the size of tgz as it was created, it was as expected.

Both errors occurred while dealing with huge files.

One of the hard drives on which this occurred is still in use and does not cause any problems. I also checked the RAM with memtest.

Now I do not trust the tar+gzip process any more.

The tar+gzip process is very convenient to save space and archive folders, that’s why I used it. Probably by “Why archive?” you mean that I could have used

zip -r …

I shall try it for huge files; haven’t used that a lot.

djh-novell · December 7, 2011, 1:10pm

ZStefan wrote:
> I have encountered this twice, and want to know if you also have
> experienced data loss because of zip+tar.
>
> I created a huge tgz file with
> tar -f my.tgz -v -c -z …
>
> The purpose was backup, and the file was around 50 GB.
>
> After a few months, I could not restore the content. There was an error
> message probably from zip library. After attempting some rescue
> measures, I understood that the file could not be recovered.
>
> I decided to check the hard drive on which the file resided. badblocks
> in non-destructive and destructive modes reported no errors, and there
> was no reason to believe that hardware is at fault.
>
> Now I do not use zip for large files.
>
> It is surprising that such an old and tested software can fail
> (perhaps).

Like noident, I think it is unlikely tar messed up. It’s just as likely
that your filesystem or memory or disk messed up, and likeliest of all
is that the human messed up.

But we can’t help you decide what went wrong, because you haven’t told
us anything about it! What system are you using, what version of tar?
What filesystem? What hardware are you using? What exactly was the error
message? You say “there was no reason to believe that hardware is at
fault” but you don’t tell us what investigations you conducted to reach
that conclusion, other than badblocks and I hope that wasn’t all!

I regularly use tar and gzip for big files and don’t have problems.

ab · December 7, 2011, 2:56pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> Now I do not trust the tar+gzip process any more.

That you’re using tar tell me you are not new to computers, so you are
probably familiar with the random problems that can come into things.
‘tar’ and ‘gzip’ are both made to work on streams of data which has at
least one huge advantage: size doesn’t matter (at one point tar had a
limit of 8 GB files, though that was overcome a decade ago or so I
believe). Anyway, huge chunks of the world use this same combination
for all of their storage, and you will have better luck with tar+gzip or
tar+bzip2 than with zip for large files for one large reason: zip does
not support storing files >= 2GiB (zip64 does, but not sure how common
that is now).

The long and short of this is that before you rule in/out something you
should test it. If creation of the tar file works then extraction later
should work barring corruption that happened in the meantime (obviously
not the fault of ‘tar’ since the data are (or should be) at rest).
Having a checksum of the data before/after is a good way to see if
anything has changed. You could do this test simply at any time to see
if tar+gzip can handle your data since creating the archive and then
immediately extracting it proves the technology one way or another
quickly. You could even do this without taking any disk space:

tar -zcv /path/to/archive | tar -ztv
echo $?

If the lats line printed is ‘0’ then all was well. The above commands
create an archive but pipe the output directly to a ‘tar’ command
decompressing and testing the contents (without writing data anywhere…
just reading everything from disk and then basically throwing it away
while testing it).

> The tar+gzip process is very convenient to save space and archive
> folders, that’s why I used it. Probably by “Why archive?” you mean
> that I could have used

The person asking probably asked why do an archive in the first place.
The downside of any type of archive is that if your hardware (again,
this is my vote in your case since you seem to know what you’re doing
with the tar command in general) has a single random one-time problem
(it happens… look at all of the download failures people have of the
OpenSUSE ISOs for some reason on otherwise-reliable Internet
connections… bigger just makes it more obvious which is why checksums
exist) can corrupt the entire archive. Storing files individually
removes the risk of a single byte affecting gigabytes of data. It’s a
tradeoff, but one that I use as well (rsync backs up my stuff… no
archiving with ‘tar’ because it just means more work to create/extract
the archive).

> zip -r …

Huge files is not the strength of the ‘zip’ format. Chances are that
most of your sensitive data out online somewhere are being saved with
‘tar’ more than ‘zip’; it’s worth going with ‘tar’ if possible because
its track record is just that good.

Good luck.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJO33CcAAoJEF+XTK08PnB5sRcP/AizvXrvWzK8/2HXz4+CpyZ6
DoMd5Yx66tz4SqqBmAqtlMHUmYXcZqfobhtr0JqwIlE6WU1lc5GSvPEqvG0WOrP3
KbeLTR/Iq+PqaLstsuf8weOHmaCZ9C4uBwsef/MpK+zsWq3qfPQrBRFQFmPSoMDt
sZ6OHJWgsNJbTV9Z3AGr/+o9XvVjhlq7ikdDopIA2mqhNDymO5TtQrG3QeDGX7/I
BsaBcW3aLMOPZ7cUPfEJ9/iOUmwcrGWo68ZOr72Shi44ooq6oXQQe8RNGGwh910C
WXXL8lqsXuWV2onwPs64tGzEh2kWOO4wL0KQZH9nQXhMQ5JYOoQwSm0yvZxxoZjH
mxdbGLR0W9ANAUUNiP8QMf6z9hyn+LWBkPcFTjkRp0jVLh0CQ7fp8wC7uWTefJ1R
kLmhBCCJOuxiDTIeVnBR1yw55fAIo8gLwYhvXt6DRC5li7tA207U39kjv6qeLRno
z8w9XHoaxxmfXilR99fH5y4V2qNMMXxDe2Ur6No3LnTApAQyrq2oqcxL/gu8WLn3
Bo9n4/uAOY0BgstB+0POWWYqV+NDonEixHZRgwZj0cl8xJtqq3pAX34C/jNp6/qx
s6EK5QbWRUnNoNIW2BLR8x9wFCURN4vckD4quW55Wud3edQmJecro6VyiAeigH7E
CTv3neX2KA5HG08juZif
=x9Uf
-----END PGP SIGNATURE-----

robin_listas · December 7, 2011, 3:33pm

On 2011-12-07 06:36, ZStefan wrote:
>
> I have encountered this twice, and want to know if you also have
> experienced data loss because of zip+tar.
>
> I created a huge tgz file with
> tar -f my.tgz -v -c -z …

That’s not “zip”, but gzip. Different programs.

>
> The purpose was backup, and the file was around 50 GB.

A lot.

> It is surprising that such an old and tested software can fail
> (perhaps).

It is known. Gzipped tar backups have a single point of failure: a problem
decompressing renders the entire archive useless.

The procedure could be improved with a check step: compare the backup with
the original before saying “done”.

–
Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

robin_listas · December 7, 2011, 3:39pm

On 2011-12-07 14:56, ab wrote:
> Huge files is not the strength of the ‘zip’ format. Chances are that
> most of your sensitive data out online somewhere are being saved with
> ‘tar’ more than ‘zip’; it’s worth going with ‘tar’ if possible because
> its track record is just that good.

A good backup/archiving software should have integrated a forward recovery
method. Ie, some amount or redundancy in the data so that errors can be
recovered. A tgz doesn’t have this. A plain zip is better because a failure
doesn’t corrupt the entire archive, just a file. Another method is to
compress files first, then use tar or cpio. The rar format does have error
recovery, but it is commercial.

–
Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

mchnz · December 8, 2011, 5:41am

At one time I was involved with a system that had 400 Linux servers. On each server every static file was checksummed by AIDE to detect intruders. Occasionally, a server would report a file had changed that was unexpected - it was never an intrusion, it was always bad RAM or disk.

For SATA/IDE rewriting a bad block would normally cause the drives firmware to map it out and replace it with a good one. You should check the SMART readouts on your drives for unrecoverable errors.

Perhaps verify backups immediately after creating them (I always used to do this with tapes, for some reason I’ve lost the habit now that I use disk). You could also store a checksum to track whether they’ve been corrupted after the fact.

You could run a RAM test, but I’ve never had a great deal of success with them. Maybe it would be better to loop doing backups and verifies. Are you sure on which machine the corruption is occurring?

As others have suggested, tar.gz is not a great way to backup - a bad block prevents access to the rest of the archive. Now that disk is cheap I tend to just rsync to more than one removable disk (there are attempts at Apple Time-machine style archives for Linux - haven’t tried any of them though).

robin_listas · December 8, 2011, 3:13pm

On 2011-12-08 05:46, mchnz wrote:
> At one time I was involved with a system that had 400 Linux servers. On
> each server every static file was checksummed by AIDE to detect
> intruders. Occasionally, a server would report a file had changed that
> was unexpected - it was never an intrusion, it was always bad RAM or
> disk.

No need to be bad hardware, it can be chance. Cosmic rays, for instance,
flipping a bit.

Years ago, in MsDOS, one of the options was to enable verify mode whenever
you wrote anything to disk. Writes were verified always (if enabled). It
was understood that writes could fail. Now we can’t do that.

–
Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

djh-novell · December 8, 2011, 3:25pm

Carlos E. R. wrote:
> Years ago, in MsDOS, one of the options was to enable verify mode whenever
> you wrote anything to disk. Writes were verified always (if enabled). It
> was understood that writes could fail. Now we can’t do that.

Some enterprise disks include read-after-write (RAW)

ZStefan · December 18, 2011, 9:21am

It happened twice, 2 years ago and 0.5 years ago. One was an averge-powered PC, one was an old PC. None has showed file error before or after the gzipping failure.

I have checked the hard drives on which the failure occurred first by copying huge test files, then by badblocks in nondestructive mode, then by badblocks in destructive mode. No errors.

Cannot remember the version of tar, but it was the one that was supplied by (approximatlely) opensuse 11.0 and by opensuse 11.3.

Don’t remember the error message, but it looked like from gzip rather than from tar.

Thanks for the message that you work with gzipping of large files and that does not fail.

I want to know what is the experience of tar and gzip users.

Both were average-pwered PCs

ZStefan · December 18, 2011, 9:49am

I understand the need to check right after archiving, and also that a single bit error will cause a large loss.

All I want to know is this: has anyone encountered situation when tar+gzip archiving fails with large files, and the likely reason was either tar’s or gzip’s failure, since other reasons seemed less likely.

robin_listas · December 18, 2011, 3:43pm

On 2011-12-18 09:56, ZStefan wrote:
>
> I understand the need to check right after archiving, and also that a
> single bit error will cause a large loss.

As far as I know, it is a random write error, not an error you can
reproduce by creating the same archive a minute later. It is a known
problem with tgz archives.

The only solution if you use this format is to verify the archive later.

–
Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

mchnz · December 19, 2011, 8:45pm

If you google you will see some people having problems with 2G/4G files depending on their platform (documented here The gzip home page ). And people having problems with ftp in text mode.

I had heard that at least one time in the past (present?) gzip was non-deterministic - you could get two different compressions of the same input, but they will both be valid and will unzip just fine.

Perhaps use something else for a while (xz, bzip2, zip, cpio) and see if the problem goes away.