Time consumption for identifying the changes for a mirror update

This may sound a strange question, so I have to explain:

I’m still at my progress of migrating from Windows to Linux.

In Windows it’s 8TB I’m working at and where there may occur changes all over.
These changes are typical abt. some hundred files (20GB) per week.
I keep a mirror that is updated once a week.

When I do the mirror update (exchange the changed files) with a out-of-the-box program,
it will check EVERY file (timestamp, size).
This will take abt. 2 hrs for 8TB (USB3).

This isn’t very clever, and has many drawbacks.

So inspired by Apple’s “Time Machine” I wrote a program
that keeps track of every file changing within the 8TB while working in the week.
When it comes to weekly update I let my program just write the some hundred entries
it has keeped. Typically that takes 10 min. Cool isn’t it? :slight_smile:

As I’m still a newbie at Linux, I was told that a simple script of rsyncs
would do very well and quick.
But… to tell the truth, it’s hard for me to believe that Linux should
be so much faster than Windows, when doing the “stupid way” (checking every file).

If not, I would try to migrate my program to Linux (GUI, system calls etc.);
maybe write anew.

So that was the exlanation. :slight_smile:

So would somebody please tell me how long the rsync way would take roughly
to do the described task with that data?

Moreover there may be some other aspects I’m not aware of?

As you can pass countless parameters/options to rsync to optimize different things, a general statement regarding time consumption can’t be made. Additionally it also depends heavily on your hardware setup. You need to test it yourself on your hardware.

I can’t test it (otherwise I wouldn’t have asked) - everything is still at Win.

It’s hard to believe the hardware has so much influence, but the method.

Depends on the rsync options used.

I’ve compared rsync to Win’s robocopy.

rsync outperforms robocopy.

There is NO way anyone can tell you the time it would take. Too many variables involved like options used, then file size, file type, directory structure, hardware used, and so on.

If you’re going to continuously do a backup with rsync … with the proper settings, after the initial backup, it will only copy files that have changed since then.

After some more thinking… :slight_smile: :
Actually I can do the comparison @Windows by myself:
Just mount the two NTFS partitions (source and dest) and rsync.
When the next sync becames due… I’ll post the result.
I bet rsync takes the same 2 hrs as the out-of-the-box apps.

Thank you for your support… that brought me to the idea. :slight_smile:

It’s not the copy process that makes the difference, but the traversing.

Yea, maybe the confusing part (for me), is the “still migrating” part.

According to a post (below) of your’s with success of an openSUSE install … that was a year ago!

My guess is you’re still using Windows and don’t intend to migrate anytime soon.

One thing I think is important to consider, performance wise.

Doing an rsync of NTFS to NTFS will never compare to, say, rsyncing native Linux filesystem types. To add, rsyncing | copying NTFS to NTFS is not “migrating” to a Linux all native (file) system.

As mentioned by another user, rsyncing a native Linux filesystem will run circles around doing the same with NTFS.

XFS (we use for /home filesystem) outperforms NTFS when used with rsync, especially for large file operations or when dealing with many small files. NTFS, while functional under Linux, is a Win filesystem and the driver layer will show overhead, particularly with kernel calls. This will lead to slower transfer speeds. Even ext4 is faster.

Damn!
That was a fear that came to me too.

So we’re at the starting point again:
I can not do the comparison by myself.
And need the help of anyone to let me know how long a rsync run
takes for a large amount of data (with relative small part to copy),
so I know if it’s worth-wise to write my Win-app for Linux.

I don’t understand that. Could you please explain?

Why on earth do come to that thought?
Right, about a year ago I succeeded in installing openSUSE;
and since then I see how to migrate the special sw running at the old Win7 system.
When I’ve the time - which is anything but fulltime.
And I don’t see what this has to do with the question I ask for help.
You tell me.
(BTW In one of your posts last months you did a misspelling… That proves everything!)

Backup runs yesterday from hard disk to USB portable drive
See the lines marked “<====== Note” for timing

Example1 RSYNC linux to linux (Scan only, Nothing to do)

/dev/mapper/array2      ext4      917G  229G  642G  27% /home/Userid/Mount_Array1

cbackup: /home/Userid/Mount_Array1/Data/newsdc1/ directory Found - Proceeding!!       
cbackup: /home/Userid/Mount_Media/Data/newsdc1 directory Found - Proceeding!!       
cbackup: This run will synchronize [/home/Userid/Mount_Array1/Data/newsdc1/] to [/home/Userid/Mount_Media/Data/newsdc1]       
cbackup: rsync backup From [/home/Userid/Mount_Array1/Data/newsdc1/] To [/home/Userid/Mount_Media/Data/newsdc1]       
cbackup: Results listing will be [250629_101350_cbackup_newsdc1.txt]       
cbackup: THERE ARE SPECIFICALLY EXCLUDED FILES       
/home/Userid/backup.ctl/cbackup.excludes
cbackup: ===================================       
cbackup: RSYNC Parameters are:        
/home/Userid/bin/cbackup.parm
cbackup: ===================================       
cbackup: Proceeding....       
cbackup: ===================================       
cbackup: Begins at Sun 29 Jun 10:13:50 MST 2025       <====== Note
sudo rsync  --exclude-from=/home/Userid/backup.ctl/cbackup.excludes --human-readable --recursive --perms --times --group --owner --delete-delay --delete-excluded --info=BACKUP,COPY,DEL,FLIST2,MISC2,NAME,PROGRESS,REMOVE,SKIP,STATS2,SYMSAFE /home/Userid/Mount_Array1/Data/newsdc1/ /home/Userid/Mount_Media/Data/newsdc1       
cbackup: Ends at Sun 29 Jun 10:15:06 MST 2025       <====== Note       
cbackup: Total Wall Time 0 Days, 0 Hours, 1 Mins, 16 Secs       <====== Note
cbackup: ===================================       
cbackup: ====== RSYNC Output Follows =======       
cbackup: ===================================       
sending incremental file list

Number of files: 249,124 (reg: 244,294, dir: 4,830)
Number of created files: 0
Number of deleted files: 0
Number of regular files transferred: 0
Total file size: 141.15G bytes
Total transferred file size: 0 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.025 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 4.55M
Total bytes received: 5.79K

sent 4.55M bytes  received 5.79K bytes  60.29K bytes/sec
total size is 141.15G  speedup is 31,006.91

Example 2 RSYNC linux to linux

/dev/mapper/bulk1       ext4      917G  782G  135G  86% /home/Userid/.Base-Bulk

cbackup: /home/Userid/Bulk/ directory Found - Proceeding!!       
cbackup: /home/Userid/Mount_Media/Data/Userid-bulk directory Found - Proceeding!!       
cbackup: This run will synchronize [/home/Userid/Bulk/] to [/home/Userid/Mount_Media/Data/Userid-bulk]       
cbackup: rsync backup From [/home/Userid/Bulk/] To [/home/Userid/Mount_Media/Data/Userid-bulk]       
cbackup: Results listing will be [250629_101304_cbackup_Bulk.txt]       
cbackup: THERE ARE SPECIFICALLY EXCLUDED FILES       
/home/Userid/backup.ctl/cbackup.excludes
cbackup: ===================================       
cbackup: RSYNC Parameters are:        
/home/Userid/bin/cbackup.parm
cbackup: ===================================       
cbackup: Proceeding....       
cbackup: ===================================       
cbackup: Begins at Sun 29 Jun 10:13:04 MST 2025        <====== Note 
sudo rsync  --exclude-from=/home/Userid/backup.ctl/cbackup.excludes --human-readable --recursive --perms --times --group --owner --delete-delay --delete-excluded --info=BACKUP,COPY,DEL,FLIST2,MISC2,NAME,PROGRESS,REMOVE,SKIP,STATS2,SYMSAFE /home/Userid/Bulk/ /home/Userid/Mount_Media/Data/Userid-bulk       
cbackup: Ends at Sun 29 Jun 10:13:36 MST 2025       <====== Note 
cbackup: Total Wall Time 0 Days, 0 Hours, 0 Mins, 32 Secs       <====== Note 
cbackup: ===================================       
cbackup: ====== RSYNC Output Follows =======       
cbackup: ===================================       
sending incremental file list
=======> Redacted file list <===========
Number of files: 72,541 (reg: 67,394, dir: 5,134, link: 13)
Number of created files: 590 (reg: 504, dir: 86)
Number of deleted files: 304 (reg: 249, dir: 55)
Number of regular files transferred: 1,775
Total file size: 477.38G bytes
Total transferred file size: 739.50M bytes
Literal data: 739.50M bytes
Matched data: 0 bytes
File list size: 917.41K
File list generation time: 0.208 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 741.74M
Total bytes received: 41.83K

sent 741.74M bytes  received 41.83K bytes  22.82M bytes/sec
total size is 477.38G  speedup is 643.56

1 Like

Thank you very much for investigating and providing that valuable information! :slight_smile:

For my question the first, scan run is the important one,
as it shows the time for the traversal/ comparing only.
(The copy time of 2nd example isn’t important in this case, is I suspect
there won’t be much difference between Linux and Windows).

Your config:
If I get it right then it takes 76s for abt. 250,000 files/dirs, right?
==> abt. 3.300 files/s

My out-of-the-box app @ Win
checks 320.000 files in (as I mentioned) 2h (!!!). (NTFS to NTFS via USB3)
==> abt. 44 files/s.

Wow!

I don’t know how Linux/ rsync does that (magic probably)…

but this settles it:

There is no need to recode my “Delayed sync” solution (described above).

rsync is absolutely sufficient.

It depends on so many factors, nobody else’s system - unless it’s configured identically to yours - is going to be of any use in determining what you should expect.

  • Are you running one I/O bus between drives, or are there separate busses?
  • How fast are those busses?
  • What kind of storage devices are you using? SSDs? Hard drives? If hard drives, what’s the rated data transfer rate?
  • How fragmented are the files on the storage device? (Sequential reads will go faster than non-sequential reads)
  • What filesystems are in use (yes, this does make a difference - how the filesystem stores the data - and metadata 0 will make a difference with a large number of files)
  • With rsync, data that already is on the destination can be checked in a couple of different ways (as I recall, you can tell it to calculate checksums, which takes more time but is more accurate for duplication detection, or you can use date/size comparisons to determine if a file has changed - which is much faster, but may be less reliable). Data isn’t copied if it doesn’t need to be.

Compression (either in-transit or in storage) can be a big factor as well, and the CPU will determine how fast the compression algorithm runs - but which compression algorithm is in use will also play heavily into performance.

rsync is generally a good solution for keeping data in sync, and it’s been developed for years to be highly efficient.

I know very well there is a big uncertainty if the test population is only 2.

And so 3300 / 44 could very well just by chance/ by a “bad” distribution of the factors you mentioned.

But this test is all we have now - and I tend to believe it’s an assession of the influence of linux, ext4 and rsync vs. windows and win-api.

When evaluating two different things, it’s important to understand the nature of the things being compared.

It isn’t always an apples-to-apples comparison, and raw numbers alone aren’t going to be sufficient to determine which is “better” or “worse”.

So yeah, you can get raw numbers, and as long as you’re aware that those raw numbers have different meanings in different situations, then at least you’re operating with an open mind about what it is that’s being measured.

Otherwise, you might as well compare read performance to write performance, and ignore the impact that in-device (or in-OS) caching has on reads but not on writes (for example).

The test population could be a thousand, and you’d still not have a great set of data without accounting for the variables in system configurations. I spend a fair amount of time dealing with statistical analysis, and while the statisticians I work with often say “more data is better”, when I point out that they actually mean “more good data is better”, they concur that garbage data isn’t useful at all when doing an analysis.

To avoid getting ‘garbage data’, it’s important to understand what you’re measuring - not just at a macro-level (file copy performance, for example), but at the micro-level (all the factors that impact disk i/o performance in general).

It’s like looking at a system with a high-speed CPU and wondering why a system is crawling with no CPU utilization (usually, it’s an I/O bottleneck that’s putting the CPU in a state where it’s waiting for I/O to complete before it can continue).

Backup with btrbk to external SSD: root (13G/383,956 files) and home (572G/1,222,871 files):

erlangen:~ # journalctl -b -u btrbk-Crucial.service --identifier systemd
Jul 01 05:01:41 erlangen systemd[1]: Starting btrbk backup of /home to SSD...
Jul 01 05:02:34 erlangen systemd[1]: btrbk-Crucial.service: Deactivated successfully.
Jul 01 05:02:34 erlangen systemd[1]: Finished btrbk backup of /home to SSD.
Jul 01 05:02:34 erlangen systemd[1]: btrbk-Crucial.service: Consumed 30.265s CPU time.
erlangen:~ # 

Transferred 1.70 GiB root and 20.59 GiB home in 53 seconds.

BTW: Same size backup to internal SSD in 18 seconds.

The big win with rsync is that, by default, it does not compare files that have the same datetime-modified and size. If one of those is different will it then go on to do a checksum of the files which entails actually opening and completely reading both files.

If you can trust the datetime-modified and size, then that’s a perfectly fine thing to do. I don’t know if this has been fixed, but recently some updates of Tumbleweed packages no longer altered the datetime-modified. Often files sizes from minor updates would still match, so I wound up with an rsync based backup that was not consistent. So for OS backups I now force rsync to checksum everything, that’s takes quite a bit longer, but as the OS and backup is on nvme, and the OS is relatively small, it’s not a big deal.

I’ve been using rsync to backup /home for decades. There has never been a similar issue with /home, and on the occasions where I have bothered to force a full checksum, no anomalies have ever surfaced.

Because of the default behavior of rsync (only inspecting files with different datatime-modified or size), I imagine it should be comparable in speed to a backup that has a list of what’s changed. However, I suppose the directory structure and number of files would possibly make for a difference.

Rsync has a lot of options that might be able to assist with achieving decent speeds. For example, if you are using SMR based hard drives, it’s possible that writing too fast will overfill the drives cache buffers, this may cause SMR drives to slow right down. When backing up to an external 4TB Seagate SMR-based drive I found it best to use rsync with a buffered write limit of 50M to keep the write speed tolerable (rsync --bwlimit=50M).

Well, it’s me who is the newbie (@ Linux), but I would not say

but:
By default rsync compares files by size and date. (Like most apps do by default).

I do not agree to that either.

First case is comparing ALL files.
Second case is not comparing at all but just copy the previously logged files.
So I would expect the latter method is much faster than the first one.
But due to the examples some members kindly provide
it seems that rsync (btrbk probably even more) do it so quick,
that the logging method simply isn’t necessary.
For special usage it might be however…

Maybe the fast working rsync is the reason why there is no app for Linux
which does the logging method.
As I mentioned, for Apple there’s that Time Machine, and for Win
only few of the backup/ mirror apps provide that special logging mode
but always very restricted. The app I wrote abt. 20 year ago
(I called it “DelaSync”… delayed sync :wink: ) is completely outdated,
with some severe shortcomings, using MFC and I can’t maintain it anymore.

Once again I would like to thank all supporting members.
I read all posts carefully and appreciate them!

Most apps that commonly used Linux file tools, such as cp, scp, or tar, just copy the whole file no matter what. Rsync does nothing for files with the same datetime-modified and size. So, at least as far as Linux, goes it is not like the commonly used tools.

A backup tool that knows in advance what to copy should be faster. But by how much? Typically, when writing to slow external media, the copy/write operations are going to dominate. So if rsync only copies those files with differing datetimes and sizes (and isn’t checksumming all files), it may well do a similar amount of writing and be in a similar ballpark. I did say the result may depend on other factors - for example, because it is able to be throttled to match SMR drive limits, rsync might actually wind up being faster.

If you look for them, various time-machine-like backup systems do exist for Linux. I’m not really familiar with any of them. You should also consider that btrfs has some time-machine like capabilities, but I don’t use btrfs, perhaps others will comment.

We’re misunderstanding.

As I already stated in former posts above, it’s not the time for copy that makes the difference between a “a priori logging” app and a “typical” sync tool or rsync.
The to-be-expected difference should be caused by the fact that a logging app doesn’t has to traverse the complete data set to identify (by file size and date) which to copy.
There is no need for traversal or comparison at all.

As far as I know there is no tool for Linux that does the “a priori logging”.
(Probably because rsync, btrbk etc. are so damned quick - so it’s unnecessay
for the typical case, as I wrote already).

Agreed, we are talking past each other.

When you write time, I think total elapsed-time. While omitting the traversal might save some amount of time, the time saved might not be significant if the backup time is dominated by writes (which it may well be if the destination is an external SMR drive). Maybe I save a five minute traversal, but still see 20 minutes of writes - so perhaps rsync may be good-enough - and that’s probably why it is often used, because it’s good enough.

Agreed if not limited by dominated by write limited hardware, working from a log is going to be fastest - but how to achieve a log for /home where arbitrary software creates arbitrary files.

Note also that across networks, rsync only copies the parts of the files that have changed, which also can greatly reduce writes. (It can be forced to do this locally too, but only at increased risk of damage if terminated while writing).

There are many tools, they are just not as commonly used as rsync. If you search btrfs timemachine, you will find several that take advantage of btrfs’s logging. In the past of read of some that do not require btrfs as well, but they probably are not as efficient.

Filesystem journal” - that’s the buzzword!

I completely forgot about that as my old Win app was for FAT variants
(which don’t offer journaling) and NTFS (which is disabled by default).

If rsync uses filesystem journals/ loggings then it’s no surprise
that it is so quick in identifying changes.

BTW
Meanwhile it should become clear that I chose the wrong title for this thread
at the first time. It should be something like:
“Time consumption for identifying the changes for a mirror update”.
Please excuse, being a newbie this probably happens.

No rsync does not use the journals/logging. Rsync is general purpose and doesn’t require a specific filesystem.

There are other tools available that are built to work only with the btrfs filesystem, and they implement time-machine like functionality. Btrfs isn’t a journaling file-system, it’s based on copy-on-write, which lends itself to keeping generations of a file and providing snapshots over time.

I think there are also backup utilities that take advantage of the kernel’s inotify and fsnotify file-system-monitoring to track file changes. Those tools would likely also be file-system neutral.

The main reason I continue to use rsync, is I’m used to it, and it’s proved to be pretty reliable over the decades. There are probably faster and better options. The time taken by rsync has not bothered me enough to look elsewhere. If I were to change, that would require considerable investment in time for re-scripting online, offline and offsite backups. If you’re starting from scratch, it would be well worth looking around.