I want to download a repository for which thee is no rsynch server, and am looking at wget as the appropriate tool. I believe that the correct command is, e.g.,
According to the wget documentation, if I interrupt wget while it is downloading foo and then restart wget, it will leave a truncated copy of foo and download a new copy as foo.1; if I override this behavior with the -c option then it will attempt to continue the previous download, rather than fetch a new copy of foo, even if foo has changed in the meantime. Is there a way to cause wget to totally replace any partial file?
I’m not sure, but I seem to recall that “wget” downloads to a temporary name and only renames to the correct name after successful download.
You could also check the dates (timestamps). The default, as I recall, is to give the downloaded file the same timestamp as the original. But an incomplete download will always have the time that the download stopped. That might be enough to recognize and manually remove incompletely downloaded files.
I am not sure about that. I have nowhere seen that the HTTP protocol sends a timestamp (which one?) of the original file on the HTTP server. And what to send if what is send is the result of some server side process like PHP and not of a plain file?
I have to correct myself. There is a “Changed” time stamp in the HTTP protocol header that comes with the dowloaded data.
But I am not sure what it is. It was probably clear in the days when pages were not dynamic. I just checked a page I have in a window for hours, but the change date is “now”. Probably because it contains a live video.
Nevertheless it is information that the OP maybe able to use.
According to the wget documentation, if I interrupt wget while it is downloading foo and then restart wget, it will leave a truncated copy of foo and download a new copy as foo.1
Read the documentation once again more carefully. It is not what happens with your (fixed) command line (although end result is wrong):
Reusing existing connection to repos.arcanoae.com:443.
HTTP request sent, awaiting response... 304 Not Modified
File 'anpm/rpm-yum-base-os2-i686-2015-12-22.zip' not modified on server. Omitting download.
--2019-08-10 09:28:50-- https://repos.arcanoae.com/anpm/rpm-yum-base-os2-i686-2016-08-25.exe
Reusing existing connection to repos.arcanoae.com:443.
HTTP request sent, awaiting response... 304 Not Modified
File 'anpm/rpm-yum-base-os2-i686-2016-08-25.exe' not modified on server. Omitting download.
--2019-08-10 09:28:50-- https://repos.arcanoae.com/anpm/rpm-yum-base-os2-i686-2016-08-25.zip
Reusing existing connection to repos.arcanoae.com:443.
HTTP request sent, awaiting response... 304 Not Modified
File 'anpm/rpm-yum-base-os2-i686-2016-08-25.zip' not modified on server. Omitting download.
--2019-08-10 09:28:50-- https://repos.arcanoae.com/anpm/rpm-yum-base-os2-i686-2016-12-25.exe
Reusing existing connection to repos.arcanoae.com:443.
HTTP request sent, awaiting response... ^C
bor@bor-Latitude-E5450:/tmp/wget$ LC_ALL=C ll anpm/rpm-yum-base-os2-i686-201*
-rw-r--r-- 1 bor bor 36 Jan 3 2017 anpm/rpm-yum-base-os2-i686-2015-12-22.zip
-rw-r--r-- 1 bor bor 2952000 Aug 10 09:24 anpm/rpm-yum-base-os2-i686-2016-08-25.exe
-rw-r--r-- 1 bor bor 4472000 Aug 10 09:25 anpm/rpm-yum-base-os2-i686-2016-08-25.zip
bor@bor-Latitude-E5450:/tmp/wget$
Here both anpm/rpm-yum-base-os2-i686-2016-08-25.exe and anpm/rpm-yum-base-os2-i686-2016-08-25.zip are incomplete. No second file is created.
Is there a way to cause wget to totally replace any partial file?
Define “partial file”. There is no way to detect partial download using only file size and timestamp. You either need protocol that compares file content (rsync, torrent) or you need to keep additional metadata about files (like fetch headers and store them before downloading file, then compare result and delete partial downloads).
You apparently attempt to mirror some package repository. Normally when you get new package version it also has different name. Is it really possible that this repository can silently replace package with the same name but different content?
Does this repository provide metadata about its content (like checksums)? Then you can verify content and delete any partial file (which will fail checksum verification).
Regarding “continuing” an interrupted download,
the http protocol is fundamentally unable to verify downloads by itself, and this is no different using wget or any web browser,
This is why the http protocol is fine for transferring relatively small files of kilo and a few megabytes, but highly risky when transferring extremely large files, and then you generally always want to transfer in a single, uninterrupted session so that the TCP mechanism of packet enumeration ensures your download should be complete… but notice it’s “should” and not “guaranteed” because other protocols like torrent will additionally verify chunks and the whole file by hashes.
So,
The general rule of thumb for resuming interrupted http downloads should be…
Don’t.
The exception would be if the application doing the download might supplement the http protocol with extra features, and that is why people use download managers like Filezilla if you must use an http or similar protocol that can’t guarantee file integrity.
[quote="“arvidjaar,post:5,topic:137474”]
Read the documentation once again more carefully. It is not what happens with your (fixed) command line\QUOTE]
The documentation may be incorrectly, but it clearly says
[LEFT]
If there is a file named ls-lR.Z in the current directory, Wgetwill assume that it is the first portion of the remote file, and willask the server to continue the retrieval from an offset equal to thelength of the local file.
Note that you don’t need to specify this option if you just want thecurrent invocation of Wget to retry downloading a file should theconnection be lost midway through. This is the default behavior.‘-c’ only affects resumption of downloads started prior tothis invocation of Wget, and whose local files are still sitting around.Without ‘-c’, the previous example would just download the remotefile to ls-lR.Z.1, leaving the truncated ls-lR.Z filealone."
I don’t see how I could have misinterpreted that.
A target file that doesn’t have all of the content of the source file.
[/LEFT]
It clearly says: “When running Wget without ‘-N’, ‘-nc’, ‘-r’, or ‘-p’, downloading the same file in the same directory will result in the original copy of file being preserved and the second copy being named ‘file.1’.”
[LEFT]
A target file that doesn’t have all of the content of the source file.
[/LEFT]
You conveniently trimmed my explanation that it is impossible to detect using file properties only.
You conveniently quoted a section of the manual totally different from the one that I cited. The section on -c does not have a link to the section on -nc. Had you wanted to be helpful then you would have said that the documentation was inconsistent.
[LEFT]
[/LEFT]
You asked a question and I answered it. I did not quote text that was irrelevant and incorrect to boot; the -N option wouldn’t work if wget didn’t have access to the timestamp.