The blocksize could be different, though that is unlikely.
More likely, you are dealing with a sparse file. In a sparse file, you might have a large file size, but a small number of disk blocks. For example, if you created a new empty file, and then you did a seek to position 1mb in to write, you would get a sparse file. Reading that file would return bytes containing zero, until you reached the actual data. But there aren’t any disk blocks for those zeros.
When you copy the file, that copies the zeros. So the destination file won’t be sparse and will have more blocks assigned. The “cp” command and several other commands actually have options for handling sparse files.
you mean the number at the beginning of the output line? I have also no idea, never saw it before. It is not in a simple
ls -l
It must be the result of one of the many options you use. Did you tray to add them one by one util you see this number appear?
BTW, I do not think the -1 option is usefull here. It is when using
ls
because that only shows file names and without the -1 makes several columns if possible. I do not think the longer lines of an ls -l need this.
Also I am not sure why you use the --size. Why not simply let it as it is in bytes?
Another remark from me is that rsync is designed to copy files. I doubt it guarantees that copying a tree of files results in somethng that is exactly a byte by byte equal to the original. It does so for the contents of the individual files. But directories may become organised different I assume, and when the results are on a file system of a different type that is almost sure IMO.
May be it is just about the terms you use. Sometimes it looks as if you use the terms “directory” and “disk” as synonyms, which they are not. E.g. when you talk about a 2Tb directory, which would be beyond any practical meaning (think of the many Tb all those millions of files administered within such a directory would take alone).
So, is this about a directory with all the files administered by it (the complete tree of directories and files starting from this directory), a file system (mounted at that directory you talk of), a disk partition, a whole disk? And yes, some of these options may turn out te be (allmost) the same. But nevertheless it is better to be sure we all mean the same.
For me in this case, exact means identical, so that if I were to delete the source directory, I would not lose any data.
Also I am not sure why you use the --size. Why not simply let it as it is in bytes?
I’ll try this.
It turns out that "rsync" does have a "--sparse" option.
I’ll try this too.
May be it is just about the terms you use. Sometimes it looks as if you use the terms "directory" and "disk" as synonyms, which they are not. E.g. when you talk about a 2Tb directory, which would be beyond any practical meaning (think of the many Tb all those millions of files administered within such a directory would take alone).
Thank you for letting me know my word choice could be better.
By 2Tb directory, I mean a directory with 2Tb of data in it.
This is how my data is organized, everything under one directory. This make the rsync script simple.
Well, rsync is around for several tens of years. I would say you could now trust that it is able to do this basic functionality.
For all the years I have used rsync for all sorts of goals (backup of course amongst that), it never failed to copy files to a place and state where could be recovered.
But everyone to his own hobby of course. It is only I am afraid that you will be busy with all sorts of side effects of the test you are doing (which in themselves may be interesting subjects, like “what is the meaning of that extra number in front of the output lines of ls in certain circumstances”).
BTW, quoting people’s text is done with the QUOTE tags: the button with “speaking cloud” just left of the CODE tags # button.
Yes, that’s what I assumed you meant. And I assume that you mean that in the recursive sense. That is, your count includes the data in subdirectorys.
In a strict sense, a directory contains only names and related info (inode numbers). The data is in a file. I can put a 2Tb file into a directory. But the actual data in the directory is just the name and inode number of that file. It is, of course, convenient to talk as if the data in the file were part of the directory. But many directories are only 4096 bytes in length (as shown by “ls -l”). You would use the “du” command to get the total data, including the what is in files listed by the directory.
I mention this because it relates to some of your questions. If I have a directory with many files, then the size of the directory could be quite a bit larger than 4096 bytes. But maybe I then delete most of the files (or move them to a different directory). The size of the directory stays large, even if it contains only one or two file names – because most file systems are not actively shrinking directories to minimal size. If I now copy that directory to a new disk, the copied directory will be newly created with only the one or two file names. So the size of the copied directory will be much smaller. Does that count as an exact copy?
That last sentence is mostly a rhetorical question. It does not need an answer. But it helps explain why I asked what you mean by “exact copy”.
The discussion is meandering away from the original post, but for me the discussion about rsync is very relavent.
In my experience, when copying a Tb size from-directory to an EMPTY to-directory, rsync fails every time. The exact reason is not clear to me, but I think it’s related to permissions combined with the rsync delete option to remove empty directories.
To get around the rsync stopping, I always do a cp first pass to get the files there, then I do the rsync pass using the md5 compare option.
The rsync --shared option will explored ext.
One more side note. I read an article about an system administrator having to move Peta byte amounts of data, and the associated copy verification effort.
The strategy used was to make two copies of the original directory to the new disk. Then a diff was done between the 3 directories. Only when the 3 diff files were
identical or explained was the copy considered to have very high integrity and only then was the original and the extra copy directory deleted. This the backstory for
interpreting the ls command. I am trying to duplicate this 3 directory copy concept.
I very much appreciate OpenSuse and have weened myself off Windows, where I was using robocopy and teracopy.
I have the stromg idea that the extra number before the lines has something to do with the --size option. (i already suggested you to do such a test, but you are too late now. I assume the --size option does not replace the size in bytes that is already in the ls -l listing, but is something added.
Also, as I have indicated earlier, I doubt if using ls for the purpose you have is a sound idea. To many unknowns here of which a few have aleady been indicated (sparse files, directory sizes do not shrink).
A few years ago I cooked up a simple little throw-away Ruby script to provide me with exact numbers and sizes of directory contents, comparable between all filesystems and operating-system platforms I’ve encountered.
The script (which I called »sum-it-up«) proved itself quite useful and »fungible«:
#!/usr/bin/ruby -w
# vim:ai:nu:et:sta:sts=4:sw=4
sum, folders, other = 0, 0, 0
d = Dir"**/*"]
d.each do |f|
if test(?f,f)
sum += File.size(f)
else
if test(?d,f)
folders += 1
else
other += 1
p f
end
end
end
puts
puts "%20d folders" % folders if folders > 0
puts "%20d #{other} non-plain files." % other if other > 0
puts "%20d bytes in #{d.size - folders - other} files." % sum
Here is what it looks like:
rig:~ ▶ cd code/
rig:~/code ▶ **sum-it-up*** # invoke as ruby scriptname otherwise*
"cmdcontrol/combine"
565 folders
1 1 non-plain files.
322593981 bytes in 6154 files.
rig:~/code ▶ _
The one non-plain file is a broken link to a »combine« command. All I’m usully concerned is that the correct number of folders, files and the byte-exact amount of file data are present at two distinct locations.
Note: the script ignores »invisible files« (.directory, .DS_Store, dot-config files etc) which I prefer. One can quickly replace the line »d = Dir"**/"]«* with »d = Dir.glob("**/", File::FNM_DOTMATCH)«* to recursively match those »dot« files as well. Similarly, one could quickly build in any exceptions, path prunings and other stuff like checksumming contents of some or all files.
Maybe this can be of use to you guys as well. Cheers!