Local copy of web site

billingd · February 4, 2011, 11:58pm

Hi

Does anyone know of an application for making copies of web sites that can be read offline? I’ve tried using wget, but with very mixed results. Something a bit more reliable would be useful

Thanks

David

malcolmlewis · February 5, 2011, 12:05am

Hi
Firefox should be able to do that for offline reading? Preferences -Advance - Network tab you can configure the cache and website.

ken_yap · February 5, 2011, 12:16am

I remember something called wwwoffle. Fortunately never had to use it.

Some years ago, I heard or read a talk about some people in South Africa sending cached web content to remote schools on Compact Flash cards by the daily “milk run”. That was such a valiant effort. Hopefully they don’t have to do much of that nowadays.

LewsTherinTelemon · February 5, 2011, 12:38am

Hi there,

I’ve actually had pretty good success with wget, and it allows a fair amount of options so you can tune it how you like. Perhaps it’s just a matter of getting the syntax correct. If you want to browse the mirror you make, they you have to convert the links which wget will do for you. You can also specify the max number of directories to descend into, and an interval between pulling items (so as not to hit the site too hard), etc. What syntax did you try with wget - maybe we can just tune it so it works well for you.

I’m assuming you don’t have ftp access, and you just need to pull the content via http. If you do have ftp access, a pretty handy tool I’ve used is Mirror FTP Tool - Lyceum

Cheers,
LewsTherin

gogalthorp · February 5, 2011, 2:20am

Depends on how the HTML is written. There should not be much problems if relative paths are used and the content is static. But if the content is generated on the fly and uses PHP or other scripts, portions or all the data is from a database or multimedia server. It may not be possible to get a locally only running HTML setup.

DenverD · February 5, 2011, 2:16am

On 02/05/2011 12:06 AM, billingd wrote:
> I’ve tried using wget, but with very mixed results.

i’ve not done it with wget in years but when i did it was sometimes
a goog bit of work to figure out how to get the command line switches
justright* for any particular site/task but once done it was just
let’er rip…done, every time perfect–until something changed…

but, i must admit that that was before sites were mostly just a series
of static pages…now they are so very complex, with css and (for
example) database driven pages built “on the fly” by scripts and black
magic…sometime with double magic magic bouncing data in from several
different locales all at the same time…an scripts galore (not to
mention php and ruby in the sky with diamonds!

otoh, i’m surprised that wget has not kept up with the advances in web
site trickery…if they have, setting it up must take some thought
and skill…and, luck.

–
DenverD
CAVEAT: http://is.gd/bpoMD
[NNTP posted w/openSUSE 11.3, KDE4.5.5, Thunderbird3.0.11, nVidia
173.14.28 3D, Athlon 64 3000+]
“It is far easier to read, understand and follow the instructions than
to undo the problems caused by not.” DD 23 Jan 11

ab · February 5, 2011, 5:01am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

If nothing else Firefox can do this with right-click: Save Page As. Open
the page from your Desktop (or wherever you saved it).

I’ve also had family use this Firefox Add-On for longer-term saving and
categorizing, though it wasn’t necessarily made for this purpose:

http://amb.vis.ne.jp/mozilla/scrapbook/

Good luck.

On 02/04/2011 06:36 PM, gogalthorp wrote:
>
> Depends on how the HTML is written. There should not be much problems if
> relative paths are used and the content is static. But if the content is
> generated on the fly and uses PHP or other scripts, portions or all the
> data is from a database or multimedia server. It may not be possible to
> get a locally only running HTML setup.
>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJNTMuUAAoJEF+XTK08PnB5lvcQALM/7ypdihzcGSnHJ4Z4A0Lq
J+uCNTQy1xqCDMtLmcW1PByyLqJSRFzITbHzWDEhcDn0V5m0WzB2Chnj5ekQf/hi
samadUbh3XPXVQf536Bwpqvk8mS/HNCIIOkXK5r5XAJ4L/MwR7F9cYF2j2PK+Bci
NOlsKptMQzzZt3TlbtB3CJ9w8PeDX+nJ8Ha5yync29IeLLjVCyXagxVjNI4D51TP
3wkRw8gqpuWY0rcP6dL9oNu7hlB8NPmTcTldQaf6x/0ye5j58zzyxVUlEJi5EYjU
u8/KvT/6wTIDbkf9IkMF/pA0jv0zy6DXwSqwPpc3xzDAt3QYAWgs1JmcQ/uC1RLw
ZE9uVn1+jRJSc/VLwL3/RPIi69HE67KwebITIEhSG0RzA66XFzDO4YTKYNDJhlla
FffPUK4NSUPZEDfascj4cXp9Im/B0T8eEiISkW9avWx0c1AxXVmkLR29vF9NRjf3
FyixX9KuDWtgxef0XC+JV6t5N9tDA0xXyKbTi0iLA6HKoGDFebIcPE+2dqs0nRb3
dHOsoUUSCKx6p9m+xuLmBB0QteYV/Qs8pcg8nVHsk+V3OF8pL9PDO2UUcE22Tqkp
TIWBro6LoLfbaAn1imlrhFdpYHzYlx73aT6EzrBk8ExQKSDlSzhVN7epASXHrFca
+p2+FJ4jsVQOPisCJCMB
=i11w
-----END PGP SIGNATURE-----

hcvv · February 5, 2011, 10:23am

I used HTTrack for this purpose. The package httrack is in the OSS repo for openSUSE 11.2 (and most probably also for 11.3).

ah7013 · February 5, 2011, 10:26am

httrack is in the packman repo for 11.3.

hcvv · February 5, 2011, 10:35am

For 11.2 it is also in Packman :shame: :shame: and not in OSS.
Sorry for the confusion.

tsu2 · February 5, 2011, 4:12pm

I also use Webhttrack from the 11.3 OSS to snake websites (copy entire websites, re-writing links for local browsing).

It’s got its drawbacks

Doesn’t understand a lot of formatting syntax
Doesn’t have a simple URL blacklist/whitelist to exclude specific Domains and paths.
The word database compiler hasn’t worked for me yet
Is slow. Obviously the base code can either use a re-write or be ported to a compiled codebase.

But, it does get all the text content fine and allows browsing offline.

Tony

oldcpu · February 5, 2011, 4:19pm

Years ago I used to use ‘sitecopy’, but imho its a pain to setup. More recently I simply used ‘wget’ as others have noted.

tsu2 · February 5, 2011, 4:44pm

Since wget only copies the website, unless others know something I don’t, it’s not browsable because the original links will still point to original locations.

I supposed it <might> be possible to deploy the copied website on a local webserver, then modify the Hosts file to point the website’s Domain to the local IP address of the copied website…

Tony

oldcpu · February 5, 2011, 4:49pm

It depends on how the links are setup. Most sites that I copied had internal links relative to directories and subdirectories on the site where they were hosted, and not hardcoded to the specific site.

tsu2 · February 5, 2011, 5:14pm

Yes, I’d agree.
Problem I guess is that for the websiites I’m often interested in, the site is often built modularly with different parts served from different machines… In a higher capacity deployment it’s common for images to be served from a different machine and larger websites (eg typical news sites) will usually aggregate content from multiple domain sources even within the same business.

This type of website can’t be simply copied without modifying links if it’s to be browsed offline.

Tony

billingd · February 5, 2011, 6:08pm

Thanks everyone
Looks like some useful ideas there - especially the Firefox scrapbook and httrack, which I certainly intend to check out. I was trying to make a local copy of a forum for browsing on a train journey. Had to resort to reading a book :). The main problem that I was having with wget included links not being available offline. To give you an example of my usage:

wget -r -l3 --convert-links www.website.com

Once again, many thanks for the great responses

David

DenverD · February 6, 2011, 2:37am

On 02/05/2011 05:36 PM, tsu2 wrote:
> This type of website can’t be simply copied without modifying links if
> it’s to be browsed offline.

i’ve not looked at the man in a long time, but i believe wget will
change page’s internal addressing and what you wind up with on you
local drive will be browseable…

however, you must read the man and use the correct switches.

ok, i opened the man and see this: “Wget can follow links in HTML and
XHTML pages and create local versions of remote web sites, fully
recreating the directory structure of the original site. This is
sometimes referred to as ``recursive downloading.’’ While doing that,
Wget respects the Robot Exclusion Standard (/robots.txt). Wget can be
instructed to convert the links in downloaded HTML files to the local
files for offline viewing.”

and

“–convert-links
After the download is complete, convert the links in the document to
make them suitable for local viewing. This affects <snip>”

–
DenverD
CAVEAT: http://is.gd/bpoMD
[NNTP posted w/openSUSE 11.3, KDE4.5.5, Thunderbird3.0.11, nVidia
173.14.28 3D, Athlon 64 3000+]
“It is far easier to read, understand and follow the instructions than
to undo the problems caused by not.” DD 23 Jan 11

tsu2 · February 12, 2011, 7:43pm

Intrigued by this thread, I spent a couple days testing whether wget could replace webhttrack in my toolbox.

Sadly, not.

First, this is the wget command I used for testing, using verbose options for better visualizing the option during testing. The target website “localnewssite” was tested against a few sites with similar results. Textfiles “excludedomains” and “excludefiles” were created to attempt to blacklist specified domain names and file types (eg pdf).
Briefly the options are supposed to do the following:

level is set to the target and 2 child levels
If not set to the site’s homepage, should retrieve only child pages to the target, never traverse
Convert all pages to an html extension so page would be recognized as client-side code only
Re-write links for offline browsing
Return content from child remote hosts, not just the target host
Download all necessary to rebuild a client-side only page


wget -r --level=3 -e robots=off --exclude-domains excludedomains --reject excludefiles  --no-parent --html-extension --convert-links --span-hosts --page-requisites http://www.localnewssite.com

Results:
wget is only single threaded supporting a single network socket. At first I also used the “pause” switch to avoid hammering the server, but in the end I decided the download would be too long and on a very big site the “hammering” probably isn’t a terribly unusual load.

Appears to download page elements one by one, not an entire page with all its parts at once which can result in incomplete downloads if overall download is interrupted. Compare this to a regular web browser which will typically download the page and all its elements simultaneously using separate network connections.

Re-writes links only after the download is complete which can mean it never is done if download is interrupted, and can be very slow.

Re-writing links to remote domains is faulty. Wget creates a directory object representing the remote domain but then per the link’s original syntax only points to the directory without understanding that on a real website a default document is specified. The result is a broken link that only points to the directory and not to a file within the directory.

Although I have problems with webhttrack, at least it supports multiple network connections and usually builds offline browsing links correctly.

IMO,
Tony

DenverD · February 12, 2011, 8:53pm

On 02/12/2011 08:06 PM, tsu2 wrote:
> Although I have problems with webhttrack, at least it supports multiple
> network connections and usually builds offline browsing links
> correctly.

my Rule One: Use what works for you.

i hear that MS runs fine in a VM…

note: i did mention that wget is OLD software and it worked great
before web sites got so complicated with on-the-fly page generation
etc etc etc…

–
DenverD
CAVEAT: http://is.gd/bpoMD
[NNTP posted w/openSUSE 11.3, KDE4.5.5, Thunderbird3.0.11, nVidia
173.14.28 3D, Athlon 64 3000+]
“It is far easier to read, understand and follow the instructions than
to undo the problems caused by not.” DD 23 Jan 11