using wget as an offline browser to download all mp3 files from a website.

I now interrupted the running (19:09).

I have downloaded:

henk@boven:~/test/rupesh> du -sh www.pravachanam.com/
701M    www.pravachanam.com/
henk@boven:~/test/rupesh>

There are

henk@boven:~/test/rupesh> find  . -name '*.mp3' | wc -l
23
henk@boven:~/test/rupesh>

Some of them:

henk@boven:~/test/rupesh> find  . -name '*.mp3'
./www.pravachanam.com/sites/default/files/pravachanams/Telugu/Sanathana Dharmam/Thirunagari Lakshmana swamy/Subhashitaamrutham/05_paropi_hitavaan_bandhuhu.mp3
./www.pravachanam.com/sites/default/files/pravachanams/Telugu/Sanathana Dharmam/Thirunagari Lakshmana swamy/Subhashitaamrutham/03_na_prahrushati_sanmaaney.mp3
./www.pravachanam.com/sites/default/files/pravachanams/Telugu/Sanathana Dharmam/Thirunagari Lakshmana swamy/Subhashitaamrutham/07_sareerasya_gunaanaam.mp3
./www.pravachanam.com/sites/default/files/pravachanams/Telugu/Sanathana Dharmam/Thirunagari Lakshmana swamy/Subhashitaamrutham/08 kim kulena vishalena.mp3
./www.pravachanam.com/sites/default/files/pravachanams/Telugu/Sanathana Dharmam/Thirunagari Lakshmana swamy/Subhashitaamrutham/subhashitha_upodhgatham.mp3
.
.
.
.

What wget says:

.
.
.
--2017-10-11 19:08:44--  http://www.pravachanam.com/sites/default/files/pravachanams/Telugu/Sriramanuja%20Scriptures/Ushadripathi%20Swamy/SriRanga%20Gadya%20Vaibhavam%20Ush%202017/Day%2007%20SriRanga%20Gadya%20Ush%20Pravachanam.mp3
Verbinding met www.pravachanam.com:80 wordt hergebruikt.
HTTP-verzoek is verzonden; wachten op antwoord... 200 OK
Lengte: 20062352 (19M) [audio/mpeg]
Wordt opgeslagen als: ‘www.pravachanam.com/sites/default/files/pravachanams/Telugu/Sriramanuja Scriptures/Ushadripathi Swamy/SriRanga Gadya Vaibhavam Ush 2017/Day 07 SriRanga Gadya Ush Pravachanam.mp3’
.
.
.

So, what is the problem?

Something I only see now:

linux-ps66:~ # wget --convert-links -c -t 0  -v --recursive --force-directories --no-clobber --accept mp3 --directory-prefix=/root/temp http://www.pravachanam.com/categorybrowselist/20

You run this as root and you place the results in the home directory of root.

If there is one thing you can make other Linux users wild, then it is **to do things as root, that should not be done as root!
**
The most easy way to destroy your system is to run things as root that should not be run as root.

You should stop this immediatlyl

Run all this as your normal user (probably rupesh). Else almost nobody will be willing to help you anymore.

The problem is to download only mp3.

Ok, and do you know the answer? As you pointed out, --accept mp3 will not function as one thinks on first sight.

You could of course download he hole web site and then delete all not mp3 (with a find), but that to me that is a bit artificial solution and probably not one fit for a slow internet connection.

Strange thing is that the --accept .mpr or -A.mp3 is found in “how tos” all over the internet. Are all these people silly and just say things without ever having it tested?

Of course I tested it myself and the transfers are indeed stopped after the first few files.

No. I never needed this. But the problem is not trivial - you cannot blindly follow all links as it means exactly download everything, so it should probably first issue HEAD and check MIME type for text/html before loading and parsing it.

Well, it’s not that it never works. You can gather all files on single page and even do it recursively as long as all links point to *.html pages.

But yes, this is the main problem - most instructions you see in Internet do not mention exact conditions, so unless you follow them in exactly the same environment there is no guarantee they will work. And because these instructions do not explain backgrounds or boundary conditions (often because authors obviously do not know them themselves) you are left out in the cold.

Thanks for the explanations. It is getting a bit more clear to me.

Maybe I can try to express it in a different way because the OP tries to mirror the directory structure of the original web site, but then with the file with name ending in .mp3 only.

The directory structure on the web site itself is NOT the same as the tree of web pages (starting at it’s “home” page). In fact it can’t be because the pages can link to any other one irrespective of there place in the directory tree. And that is why it is a “web” in the first place.
The fact that the web pages on a web server are organised in a directory tree (how flat or how deep it may be), is just how the web master thinks it is most easy to maintain the web site. It can mirror (part of) how the observer sees the connetion between the pages, but often it will not.

E.g., web masters will often gather images in one place/directory irrespective from where in the web site they are referred to.

Thus “recursive” as used in the -r option in wget, has no meaning with respect to any directory tree (as it has in e.g. chmod -R), but it points to how may layers of links you follow (where you can easily land on the same web page again and again, and there are many option in wget about what to do if the same page is encountered again).

Thus when a “home” page has links to other pages where the name ends in .mp3, they will be detected as such, but other file with names ending in .mpr in the same directory will noty be found when not linked to.


web-root-directory--|
                    | home.html (with links to mus1.mp3 and mus3.mp3)
                    | mus1.mp3 (will be found by wget)
                    | mus2.mp3 (will not be found by wget (and also not "seen" as link by the end-user)
                    | mus3.mp3 (will be found by wget)

OTOH when that home page links to a HTML page mp3list.html that says "this is the list of MP3 files and that page then has those links, the *.mp3 will not be found at all, because mp3list.html, being not an .mp3 file, will be skipped and not interpreted by wget.


web-root-directory--|
                    | home.html (with link to mp3list.htmll)
                    | mp3list.html (with links to mus1.mp3, mus2.mp3 and mus3.mp3; will be skipped by wget, no .mp3)
                    | mus1.mp3 (will not be found by wget, no link to it seen by wget)
                    | mus2.mp3 (will not be found by wget, .... )
                    | mus3.mp3 (will not be found by wget, .....)

I hope I have this correct and I hope also it helps in understanding.

Did you mean “mus2.mpr”?

mp3list.html, being not an .mp3 file, will be skipped and not interpreted by wget.

No, *.html is implicitly followed by wget, even in presence of other filters. But in OP case links do not have any extension at all.

Sorry, I have some “mpr” where I wanted to type “mp3”. Silly crossed fingers or some otehr senior problem.

You mean that directory names are at the end of the URLs, not files. And the web server then falling back to e.g. index.html (or what else is configured), but wget acting on the original URL?

It isn’t indeed very clear what

www.pravachanam.com/categorybrowselist/20

is. I assume a directory.

But it downloads

www.pravachanam.com/index.html

which it then removes because it is no mp3.

It may look in that file, and I did. It contains no URLs ending in .mp3, nor ending in .html. I guess the exercise stops there.

The page is full of script references, but wget can of course not guess if those scripts are ever resulting in an URL ending in .html or mp3.

All that sever site and client site scripting blocks IMHO what the OP wants to do.

I do not understand that, sorry.

And the web server then falling back to e.g. index.html (or what else is configured), but wget acting on the original URL?

It does not matter what web server does. wget gets original page and scans it for further links (“a” tags or whatever). It takes URLs on the page literally. It has no idea at this moment whether they resolve to another HTML page or to foo.zip file or to song.mp3. As I already mentioned, it could be done by first probing URL using HEAD request and then deciding on the basis of returned information. wget does not do it.

But it downloads

www.pravachanam.com/index.html

which it then removes because it is no mp3.

It may look in that file, and I did. It contains no URLs ending in .mp3, nor ending in .html. I guess the exercise stops there.

Correct.

Thanks for bearing with me. It is all the redirecting and scripting that is killing here.

E.g. when the link is to a PHP (thus ending in .php) wget can not “know” that that PHP script on the server will in the end offer a .mp3 file to the client. Though the clinet can see it when looking in the HTTP header to the MIME type, but wget doesn’t.

And there are far more of these obfuscations.

If the User knows the website directory structure, can specify only that directory holding the mp3 files.

Although the original request might ask for the entire website directory, then “Why?” if interested only in the mp3 files.

TSU

Band width of the connection? Disk space?

If (but I doubt this user knows), it might work in the simple case. But, as is shown above, the server could serve the MP3 files through e.g. PHP scripts, the directory where they are stored might be blocked for direct client access (or even be outside the server root).

You click on http://www.foo.bar/getmusic.php?tune=hurray.mp3 and you get back a HTTP header which says

Content-Type audio/mp3

the MP3 data follows and you will never know where it is stored on the server, not even if it is within the tree that is used at “server root” in the the server configuration.

Actually if bandwidth is not of concern, it should not be hard to patch wget to ignore filters during page content parsing and only apply them when deciding whether to keep downloaded document.

I doubt if this is ever going to work.

I vistited the web site and browsed down to a page were you can download some of those MP3 files: http://www.pravachanam.com/albumfilesbrowselist/15/20

I looked into the code of the page. The download links look like this one:

<a href="/file/11448/download?token=TU9_y7kh" type="audio/mpeg; length=23436197">Download</a>

I doubt that wget will ever be able to handle that.

Probably.
Without inspecting the page source, I suspect your link is embedded in a code block that calls a function that uses that link.
Or, that “download” in the URL could be an actual executable (boy, if that’s so I’d feel so uncomfortable that it could be a security hole waiting to be exploited).

If the function is publicly documented, then that is the way to go.

TSU

That download certainly is a server side executable, and it gets a parameter.

When you feel uncomfortable, you could try to warn them. In any case, the security of that web site is not the subject of this thread.

OK
Just for anyone’s benefit, I generally don’t think it’s a good idea to expose a binary directly to remote Users.
Better is to hide the binary and call it from a page.
Even better is to not use a binary at all and use a class library that’s a part of a standard website framework because then there’s a proven and tested way to serve files that’s not likely to have security issues.

TSU