Rsync vs Tar/Gz for backups

Dear all
I have the following backup script that packs my current folders contents to an external hard disk (in my company we have enough hard disks so there is no really need to do incremental backups)

echo '----- Backup Started '`date` >>/root/backup/backuperrors.txt
tar -zcvf /media/a9f299d7-fcbc28b3f3c0/user-host`date '+%d-%B-%Y'`.tar.gz /etc /root /home 2>> /root/backup/backuperrors.txt

I would like then to ask you
a. Is there any problem with tar gz with large size? Lets say that created output file is of size 500GB, can it be that tar/gz have limitations on the file size they can handle?

b. Can it be better to use rsync
with the -c flag for doing checksum checks while “copying”
and the -z flag to compress the file?

c. How the backup in rsync works with the -z flag ? Does it create a single compressed file like tar.gz or it compressed file by file?

d. Any better ways to use rsync for compressive backups?

I would like to thank you in advance for your help

Best Regards
Alex

I think you should first try to understand what rsync does. It is rather different from *tar. *And when you say

c. How the backup in rsync works with the -z flag ? Does it create a single compressed file like tar.gz or it compressed file by file?

my guess is that you did not understand what it says in the rsync man page

-z, --compress compress file data during the transfer

Which means imho that no compressed files are created, compresssion is during transfer (usefull, because rsync often is between different systems and thus compressing over a network may bring better performance).

In short, tar (tape archiver, but almost nobody nowadays use tape as the place to put the archive) bundles several files into one big file (in earler times on a tape).

rsync/rsyncd (you need both) synchronises files (complete directory trees) between places) often on different systems). Think e.g. you having an address book on different systems where you changed one and want the other synced with it. But also tthink of backup (I e.g. rsync the complete /home of one system to a backup place of another system). The gain using rsync is that only changed files (or even parts of it) are transfered. Another gain is that you can walk through the directory tree on the backup system and find everything in the same place for inspection and/or restore.

hcvv wrote:
> -rsync/rsyncd -(you need both)

my guess is that you did not understand what it says in the -rsync-
-man- page :slight_smile:

rsyncd is the rsync daemon and you only need it if you want to use it.
rsync will work perfectly happily without it.

But that niggle apart, everything else was good advice.

You are right. I was too furious, most probably being tempted from the fact that rsync does also the checksum checks on files it handles (which I thought might be a very useful feature when creating large tar compressed files)

I would like to add one more point in this nice discussion.
I was looking at the hard disk costs and I have found that hard disk costs are quite low these days (in other words not too expensive if you think how crucial my data are)

So the question then is
How about if I have an external hard disk of 3TB and having rsync running once per week? (like Sundays) and rsyncing my system to the external hard disk. If I am not wrong this will create on external hard disk a cloned version of my system. Would not that be true?

B.R
Alex

Currently, I am using “dar”, which you would have to install from the repos.

I have been persuaded that it is better, though I never ran into problems with “tar”.

The differences:

1: “dar” compresses files individually in the archive. This is supposed to ensure that a bad disk sector in the archive will only affect one file, instead of having effects that leak into all subsequent files.

2: “dar” creates a multi file archive, as needed, for large archives.

The main disadvantage of “dar” - the command is not on the usual rescue CD or DVD media, so you either need to build a special purpose CD, or figure on installing the system before you can recover the backed up data. I use the second of those choices - I only backup “/home” anyway.

alaios wrote:
> How about if I have an external hard disk of 3TB and having rsync
> running once per week? (like Sundays) and rsyncing my system to the
> external hard disk. If I am not wrong this will create on external hard
> disk a cloned version of my system. Would not that be true?

Yes, that will work. You can even use different disks on alternate weeks
or whatever to get more than one backup. It’s also possible to use rsync
to create many incremental backups on one disk. I use a program called
dirvish - http://www.dirvish.org/

I like (and use) the method that is automised and configurable by rsnapshot

In fact I reprogrammed it myself, but the idea is that you make an rsync copy and then uses cp -al to p create a new generation of the backup (say backup0 to backup1) that uses hardlinks and thus almost no space. A new rsync then synchronises backup0 (only changed files are copied and removed files deleted), while at the same time backup1 is still complete.

You can thus create cycles. E.g. I have a cycle of 10 and run a backup each week, which means that you can retrieve until 10 weeks back in time. But it is easy (and that is where rsnapshot helps you) to create several cycles with e.g. daily, weekly, etc. cycles.

(And yes, rsyncd is not needed in all circumstances, but it helps when youu use different systems :wink: )

Hi. I think your answer is even better.

So I go purchase the external hard disk and then I would ask for some help configuring it.
How easy is to recover a specific cycle? How easy is to fail and after one year of operation the cycles can not be recovered (in that extreme case I would guess that only the first backup would be restored… which was done one year before)

I am ordering the hard disk today and I will come for more help tomorrow or at weekend.

Regards
A

hcvv wrote:
> I like (and use) the method that is automised and configurable by
> ‘rsnapshot’ (http://www.rsnapshot.org/)

Yes, rsnapshot is an alternative to dirvish. There are more (not sure
whether that is a :slight_smile: or a :frowning: )

> In fact I reprogrammed it myself, but the idea is that you make an
> rsync copy and then uses cp -al to p create a new generation of the
> backup (say backup0 to backup1) that uses hardlinks and thus almost no
> space. A new rsync then synchronises backup0 (only changed files are
> copied and removed files deleted), while at the same time backup1 is
> still complete.

That’s how dirvish works, except it uses rsync to do everything instead
of using cp for part of the work. I’m not saying there’s anything wrong
with using cp, BTW.

> You can thus create cycles. E.g. I have a cycle of 10 and run a backup
> each week, which means that you can retrieve until 10 weeks back in
> time. But it is easy (and that is where rsnapshot helps you) to create
> several cycles with e.g. daily, weekly, etc. cycles.

Yes, and you can have different cycles for different parts of your
filesystems etc etc.

alaios wrote:
> Hi. I think your answer is even better.
>
> So I go purchase the external hard disk and then I would ask for some
> help configuring it.
> How easy is to recover a specific cycle? How easy is to fail and after
> one year of operation the cycles can not be recovered (in that extreme
> case I would guess that only the first backup would be restored… which
> was done one year before)

All the cycles are just directories in the filesystem on the backup
device, so if I want a particular file from two days ago I’ll get it
from some address like:

/backup/computer1/2012-09-11/tree/etc/init.d/rsyncd

> I am ordering the hard disk today and I will come for more help
> tomorrow or at weekend.

Sounds like a plan. I for one won’t be here at the weekend, but I’ll be
back on Monday.

On 2012-09-13 12:56, alaios wrote:
>
> You are right. I was too furious, most probably being tempted from the
> fact that rsync does also the checksum checks on files it handles (which
> I thought might be a very useful feature when creating large tar
> compressed files)

You should read again the manual. The -c parameter does not mean that it verifies the checksum
of what it writes to verify that it is correctly written. What it means is that it uses
checksum to NOT make a second copy of a file on a second run.


Cheers / Saludos,

Carlos E. R.
(from 12.1 x86_64 “Asparagus” at Telcontar)

On 2012-09-13 14:58, Dave Howorth wrote:
> hcvv wrote:
>> I like (and use) the method that is automised and configurable by
>> ‘rsnapshot’ (http://www.rsnapshot.org/)
>
> Yes, rsnapshot is an alternative to dirvish. There are more (not sure
> whether that is a :slight_smile: or a :frowning: )

For example, there is rdiff-backup. The current backup is a mirror copy, the previous ones are
changed to rdiffs.


Cheers / Saludos,

Carlos E. R.
(from 12.1 x86_64 “Asparagus” at Telcontar)

On 2012-09-13 13:16, nrickert wrote:

> I have been persuaded that it is better, though I never ran into
> problems with “tar”.

One problem and you don’t go back…

>
> The differences:
>
> 1: “dar” compresses files individually in the archive. This is
> supposed to ensure that a bad disk sector in the archive will only
> affect one file, instead of having effects that leak into all subsequent
> files.

Yes, a single error in the .tar.gz archive may render the entire archive useless.

> The main disadvantage of “dar” - the command is not on the usual rescue
> CD or DVD media, so you either need to build a special purpose CD, or
> figure on installing the system before you can recover the backed up
> data. I use the second of those choices - I only backup “/home” anyway.

Or you can have a dedicated small rescue partition.

There may be other alternatives, like “rar”. IIRC, it can add redundancy for error recovery.
I’m not sure how well it manages linux filesystem permissions, though.


Cheers / Saludos,

Carlos E. R.
(from 12.1 x86_64 “Asparagus” at Telcontar)

I think we are now getting to the point where your “real” problem is. What you did is “explaining the step instead of the goal”. You reaal goal is making backups, but you ask about the step “which tool” and then allready narrowed down to very few possibilities (where you did not realy study the documentation).

The whole subject of making backups is very complicated. IMHO it (you) should start with asking yourself why you want to backup. What are the calamities you want to be able to recover from.
. Some want to be able to recover very quick from a broken disk (partition), thus they make an image of it (using dd or likewise).
. Others want to be able to restore a document for one of their users when (s)he comes and says: “can I have last weeks version of XYZ, I borked it a few days ago beyond recognition”. A backup on directory/file level (of which we discussed some above) is helpfull here. Then comes the question of: how often and how long to store.
. For restoring the system (root partition) many only backup /etc and do a reinstall (where you should have notes about what and how you installed in the beginning).

Then about the backup media:
. One big disk with 12 monthly cycles, 4 weekly ones and 7 daily ones looks nice, until the disk dies and you have nothing left.
. Leaving the backup media in the system will let it die with the rest of the system on an internal fire or high voltage peak.
. Having the backups on another system (an older one, maybe even a text only system is fine) a few yards away from the production system(s) or better in another room is allready better and of course off line is the best for when the house burns down, but it may take time to get the off line media when you store them at a friend’s. (I have a few files I do not want to loose, stored at my ISP, I have some diskspace there included in the contract where I can put a website and/or store what I want, using webdav to get there, in modern speach: in the cloud)

You should have at least a better then vague idea about your requirements and tehn go searchiing for (a combiination of) a solution. And as allways it is a midway between “the best” and “the affordable” (as well as in money as in work).

Thanks for making it more precise :slight_smile:
(note: by rsync below I mean any tool that is rsync based or rsync it self. I would select a tool probably today)

Some answers to the question:

Back up needs
Very important to not lose any file, there is a work of 3 years that should not be lost. That means I should be able to recover the system with no losing any files from the /home partition.

Right now I have the bash script I gave at the beginning that I run it manually on an external hard disk.

Doing rsync of the system has something I like: a clone of each file to a new place (i.e given that hard disk of 2TB is not that expensive any more).

I am not sure if the cycles would provide a bit more to me or not. I am thinking that the idea of a cycle might be helpful if you want to recover from a fatal error (or human stupidity) to your system (rm -R as root). So having 2-3 cycles might be enough for my case (Having new cycle created every Sunday)

Recovering:
Then we come to the bad things happened scenario: If I have rsync I will need to clone the backup from the external hard disk to a new disk. Is that right?

I was thinking that an improvement of that might be to have all my partitions to the external hard disk and have rsync backing up everything to the right partition. If this is “valid” I would be able in the future to also clone that disk to a new one and have easier a system running. (That means that I would have also the installed packages and the configuration in /etc).

How much hard disk should I have for taking care also for 2-3 cycles? My home diretory is around 200Gb and the hard disk is 2TB. Should that be 2TB+ (meaning actually 3?)

Regards
Alex

The method with rsync and a copy with hardlinks (like rsnapshot and others do), it is a bit difficult to tell how much space you need. Of course you need at least the same space as the original. When you do a cycle each week (let us take 10 weekly cycles as just a an example to make talking about it easier), but nothing changed in that weeks time, the increase in size will be minimal. But when you changed the contents of three 10 GB files during that week, you will need 30 Gb more. When nothing then changes for 10 weeks, you will get that 30 GB back after 10 weeks.

With this system also your first backup wil take most time because everything will be copied. Later amount of time of the backups in the cycle will heavily depend on the amount of changed files.

alaios wrote:
> How much hard disk should I have for taking care also for 2-3 cycles?
> My home diretory is around 200Gb and the hard disk is 2TB. Should that
> be 2TB+ (meaning actually 3?)

2 TB will be plenty to backup a 200 GB filestore. There’ll be lots of
room for other things as well, if necessary.

Whatever backup scheme you decide to employ, don’t forget to test it.
i.e. test recovery of files and/or the whole system if that is your
goal. It’s very depressing to have a disk fail and then discover you
have forgotten something in the design of your recovery plan.

Very valid point that I always think about it… The problem then is how to check it…?
That would mean a 3rd hard disk to extract all files there?