wget website copying

Decided to give wget a spin after this thread a year ago

Am very impressed with current version of wget, is lightning fast now compared to a year ago and on several websites made a local copy flawlessly.

But, this website (command below) didn’t copy any child pages, and am at a loss why. Am hoping someone with wget experience might understand why this happened on this specific website.

Here is the problem command

wget --recursive --domains uima.apache.org --no-parent --page-requisites --html-extension --convert-links https://uima.apache.org/downloads/releaseDocs/2.2.0-incubating/docs/html/overview_and_setup/overview_and_setup.html#ugr.ovv.eclipse_setup.install_eclipse

You’ll notice that only the specified root page is copied locally. If you hover over a link in the copied page, it looks like an initial try to connect to a local copy, and upon failing will offer a link to the original page.

AFAIK the wget options I’ve selected should properly download a “full website” following all links on the specified page, only restricted to being a member of the “uima.apache.org” domain name.


Like many other issues, I found my own mistake a little while later after posting…

In this case, I was pointing wget at a file rather than the root <directory> of the pages I want copied.

So, typically when you first browse a website and find content you want locally copied…
Copy the URL in the browser, then <remove the specific page reference> leaving only the directory holding that page.


You solve it allready.

My suggestion would have been: is there no robots file on the website that is honoured. From the man page:

Wget respects the Robot Exclusion Standard (/robots.txt).

This seems not being the case here, but you could try to remember this when there is a problem on another web site. If I remember correctly, wget says so in it’s output.