Results 1 to 3 of 3

Thread: wget website copying

  1. #1
    Join Date
    Jun 2008
    Location
    San Diego, Ca, USA
    Posts
    10,927
    Blog Entries
    2

    Default wget website copying

    Decided to give wget a spin after this thread a year ago
    http://forums.opensuse.org/opensusef...-web-site.html

    Am very impressed with current version of wget, is lightning fast now compared to a year ago and on several websites made a local copy flawlessly.

    But, this website (command below) didn't copy any child pages, and am at a loss why. Am hoping someone with wget experience might understand why this happened on this specific website.

    Here is the problem command

    Code:
    wget --recursive --domains uima.apache.org --no-parent --page-requisites --html-extension --convert-links https://uima.apache.org/downloads/releaseDocs/2.2.0-incubating/docs/html/overview_and_setup/overview_and_setup.html#ugr.ovv.eclipse_setup.install_eclipse
    You'll notice that only the specified root page is copied locally. If you hover over a link in the copied page, it looks like an initial try to connect to a local copy, and upon failing will offer a link to the original page.

    AFAIK the wget options I've selected should properly download a "full website" following all links on the specified page, only restricted to being a member of the "uima.apache.org" domain name.

    TIA,
    TSU

  2. #2
    Join Date
    Jun 2008
    Location
    San Diego, Ca, USA
    Posts
    10,927
    Blog Entries
    2

    Default Re: wget website copying

    <sigh>
    Like many other issues, I found my own mistake a little while later after posting...

    In this case, I was pointing wget at a file rather than the root <directory> of the pages I want copied.

    So, typically when you first browse a website and find content you want locally copied...
    Copy the URL in the browser, then <remove the specific page reference> leaving only the directory holding that page.

    TSU

  3. #3
    Join Date
    Jun 2008
    Location
    Netherlands
    Posts
    24,863

    Default Re: wget website copying

    You solve it allready.

    My suggestion would have been: is there no robots file on the website that is honoured. From the man page:
    Wget respects the Robot Exclusion Standard (/robots.txt).
    This seems not being the case here, but you could try to remember this when there is a problem on another web site. If I remember correctly, wget says so in it's output.
    Henk van Velden

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •