Hi I am Rupesh from India. I have examined a website and it contains of 50000 mp3 files of which I want to download 11000 files and I have downloaded 8000 mp3 files and I want to download upto 3000 mp3 files from the same website and discard the remaining files.
I have downloaded the files from net using offline browser called extreme picture founder. In that application it has an option called skip if the destination file exists. I am going to re download the files and select the option skip if the destination file exists. The above application has also has options for scanning the website and spidering etc., of which all options I have understood.
Previously after downloading files using offline browser I have copied the files to another directory and the directory structure was lost but have files in another directories.
As I want to download 11000 files which is of size 135 gb but I have downloaded 93 gb. If I can obtain the directory structure of website and the files with filenames only without any data in it I can maintain the directory structure same as website I mean I can keep file names and directory names same as website.
At present I have installed opensuse leap 42.2 on my system. Upon opening terminal emulator and issuing the command ls -r > filenames.txt I can obtain the list of filenames with directory names. Is there any command or tool to obtain just the filenames and directory names of directory in website and store the output in a text file.
So please suggest a way how to obtain list of directory names and also filenames containing in those directories and store content in text file. If possible can you please suggest how to maintain directory structure same as website and also filenames without any content.
You could try website mirroring software. Look e.g. at wget. And when I see the description of pavuk, that could be solution.
I remember that I once used https://www.httrack.com/
In the meantime, please understand the following.
Most of the time a webmaster has an index.html (or similar) in a directory and that will be offered to the client when the client asks for the directory. Thus you will not see what is in the directory.
When there is no index.html (or similar) and the web server has “directory listing” switched on you get something like in this link: https://ftp.gwdg.de/pub/opensuse/
Only in the latter case you might be able to convert the HTML send to you into something you can create a mirror of the directory from.
What looks like a directory tree when you access a HTTP server, might in fact be different. Parts of a website may be stored on a different so called server root in the server. That may or may not be of interest to your goal.
Web content may be generated by e.g. PHP scripts or other server side programming. Thus the results may e.g. be different on asking for the same URL.
You talk about properties of the directories/files. When you mean the owner:group and the permission bits, it is for sure that they can not be seen from the server.
In fact all the mirroring tools will of course copy including the contents. Apparently not what you want. Deleting the content afterwards looks a solution here, but involves of course all the unneeded data transfer and is thus a bit a strange thing to do.
The site I want to download is a non-profit spiritual website and they are distributing the file’s freely. In the website itself they have clearly mentioned that it doesn’t contain any copyrighted material and if anyone finds they suggested to complain what they found which is copyrighted. For your reference I am providing the website address below. As the content they provide is not copyrighted anyone can download them.
Been awhile but IIRC even a long time ago both wget and httrack support various filters so you can specify downloading only mp3 files for instance.
Of course, if you copied the entire website, it would be translated into pure HTML no matter what the original website technology might be (eg PHP, ASP, JAV, etc). You should then simply point your web browser to index.html and browse your own copy of the website.
After using httrack for awhile, I later strongly preferred wget for its power, speed and better versatility. I built a small library of wget examples for downloading sites in specific ways.
But, for the sometimes user I highly recommend httrack as something that works and graphically presents options so it’s quickly configurable for the less familiar.
BTW - The technical term for this type of software is “scraping”
My question is straight forward ie., is it possible just to obtain the whole website structure and store it in a text file. Another interesting fact is that the site I am going to download doesn’t have links which point to other websites.
If you aren’t familiar with web scraping, use httrack. When you start it up, a wizard should launch walking you through configuration with the option to simply use a saved configuration from previous use.
You’ll be able to download the entire website, but it won’t be stored as a single file. Each page downloaded from the website will be analyzed and converted to an HTML-only page, so the end result will be numerous files, but every internal website link will be converted to a link which will work in your HTML-only website copy.
During configuration,
You will be able to configure link depths, so even external links that touch other websites can be downloaded to your copy as well. If there are no external links, that’s OK.
So again, when your website copying is done, just point your web browser to the home page((In fact, httrack may automatically launch your default web browser to the home page when completed)… although I suppose you can start from any other HTML page as well since internal website links likely will lead back to the home page unless the website is really weird.
Yep, your question is straight foward. So straight forward you’ve posted it on at least six other forums verbatim, and keep asking people to write you a script, as you did here. What you DO NOT seem to do is show us your work. I doubt anyone would turn you away if you needed help, but it seems like you’re just panhandling and trying to get someone writing scripts for you. Others, please examine this same thread on LinuxQuestions, LinuxForums, HTTrack forums, techist, and funnily enough eightforums (for Windows 8/10 users). All same thing, with no effort.
Rupeshforu3, can you prove us wrong here? Show us what code you have written? Because there are LOADS of bash tutorials. If you have a list of files you’ve already downloaded, and a list of total files, you just have to compare the two using diff, to get one list (the ones you don’t have). Writing a simple script to loop through an input file to do a wget should take 5 mins or so. Certainly a lot less time than it took you to register/post on all those forums looking for someone to do it for you. You’ve even claimed on some of them to be a newbie with no scripting experience, and have other threads going back five years saying the same thing.