Hello, when you download an html file from the web you do not need the comments inside the file, so I decided to use gawk to strip them:
gawk '{ gsub(/<!--.*?-->/, "", $0); print $0}'
I had intended to use the lazy star to avoid matching more then one comment block for the reason that I might eliminate stuff that I want, illistrated here:
<!-- comment one --> Important text <!-- comment2 -->
but, it’s not working.
Problem number two is that I wanted to edit the file in place but I’m not sure how to do that in awk, but I don’t have to do that as I might try something like this
program > tmp.html; mv tmp.html realname.html
but I’d have to change real name for each file that I feed to the program.
There probably is another program out there that does this, but I’m interested in the learning experiance.
Please explain “but it is not working”. That can mean lots of things. Describe an exact case: what you did, what you expected to happen and what happened instead.
And for the editing “in place”. That is not something that realy exists, you can not write to a file you are reading at the same time (well, you can, but you may get strange results). There may be programs that have an option to do this, but in fact they only write to an intermediate file and move that over the original after processing. The same as you do.
And of course you put that in a script when you want to do this often with different files. Something like:
#!/bin/bash
program <${1} >intfile && mv intfile ${1}
which you then call with the file name as parameter. And where you can replace program with e.g. your gawk command.
I would not recommend using awk,sed and grep for html/xml files there are tools written exactly for that purpose. Of course like any other program it has a learning curve, but that is just me feel free to use whatever tools you feel comfortable with.
lynx -dump foo.html
lynx -dump -listonly foo.html
lynx -dump -listonly -nonumbers foo.html
lynx -dump -listonly -image_links file.html
Depending on the use case one of them should suffice.
Yes, I know that I could use lynx, but I wanted to work with AWK. You will not learn anything by using someone elses program (well, maybe a little). The point of this exercise of mine was that I would learn about how to create a regex that would be non greedy, hence the *? (lazy star) and using sed with ^> will match a single >, which might be located anywhere between the beginning of the comment and the end of it since using > is not illegal in the comment as far as I know so you will wind up not removing the whole comment.
Perhaps I should take this to a regex or awk mailing list somewhere…
Yes, you should and good luck because sed,awk and grep is not written to parse html/xml.
Most probably you will get the same answer or point you to some existing tools such as