Use gawk and regex to stip comments from html files

Hello, when you download an html file from the web you do not need the comments inside the file, so I decided to use gawk to strip them:

gawk '{ gsub(/<!--.*?-->/, "", $0); print $0}'

I had intended to use the lazy star to avoid matching more then one comment block for the reason that I might eliminate stuff that I want, illistrated here:

<!-- comment one --> Important text <!-- comment2 -->

but, it’s not working.

Problem number two is that I wanted to edit the file in place but I’m not sure how to do that in awk, but I don’t have to do that as I might try something like this

 program > tmp.html; mv tmp.html realname.html

but I’d have to change real name for each file that I feed to the program.

There probably is another program out there that does this, but I’m interested in the learning experiance.

Please explain “but it is not working”. That can mean lots of things. Describe an exact case: what you did, what you expected to happen and what happened instead.

And for the editing “in place”. That is not something that realy exists, you can not write to a file you are reading at the same time (well, you can, but you may get strange results). There may be programs that have an option to do this, but in fact they only write to an intermediate file and move that over the original after processing. The same as you do.

And of course you put that in a script when you want to do this often with different files. Something like:

program <${1} >intfile && mv intfile ${1}

which you then call with the file name as parameter. And where you can replace program with e.g. your gawk command.

“Does not work” == Input and output

<!-- comment 1 --> important text <!-- comment 2 -->

% gawk '{ gsub(/<!--.*?-->/, "", $0); print $0}' a.html


Expected output

important text


As usual there are many solutions in Linux, and as you described what you want to achieve, I have taken the liberty to use another tool: sed

henk@boven:~> echo '<!-- comment 1 --> important text <!-- comment 2 -->' | sed 's/<!--^>]*>//g'
 important text 

seems to do the trick.

I would not recommend using awk,sed and grep for html/xml files there are tools written exactly for that purpose. Of course like any other program it has a learning curve, but that is just me feel free to use whatever tools you feel comfortable with.

lynx -dump  foo.html
lynx -dump -listonly  foo.html
lynx -dump -listonly -nonumbers foo.html
lynx -dump -listonly -image_links file.html

Depending on the use case one of them should suffice.

for some short info see

lynx --help

if you don’t like reading man pages :slight_smile:

That is nice, never would have thought about lynx usng for this. I repeat: very nice indeed.

Yes, I know that I could use lynx, but I wanted to work with AWK. You will not learn anything by using someone elses program (well, maybe a little). The point of this exercise of mine was that I would learn about how to create a regex that would be non greedy, hence the *? (lazy star) and using sed with ^> will match a single >, which might be located anywhere between the beginning of the comment and the end of it since using > is not illegal in the comment as far as I know so you will wind up not removing the whole comment.
Perhaps I should take this to a regex or awk mailing list somewhere…

Yes, you should and good luck because sed,awk and grep is not written to parse html/xml.
Most probably you will get the same answer or point you to some existing tools such as


which is just


in openSUSE, again good luck.

Burberry Outlet Online](,
The North Face Jackets](,
Coach Black Friday](,
Cheap Barbour Coats](,
Abercrombie And Fitch Outlet](,
Burberry Coats Outlet](,
Louis Vuitton Outlet](,
Juicy Couture Clothes](,
Gucci Shoes Outlet](,
Marc By Marc Jacobs](,
MCM Backpack Outlet](,
Gucci Shoes UK](,
Kate Spade Bags Outlet](,
Louis Vuitton Paris](,
Gucci Shoes Outlet](,
Michael Kors Black Friday](,
Michael Kors Outlet](,
Michael Kors Outlet Online](,
Moncler Outlet Online](,
North Face Outlet Online](,
North Face Outlet](,
Oakley Sunglaases Outlet](,
Michael Kors Black Friday](,
Prada Mens Shoes](,
Polo Outlet](,
Ralph Lauren Outlet Online](,
Polo Outlet Online](,
Longchamp Outlet Store](,
Tommy Hilfiger Outlet Online](,
Coach Factory Outlet](,
Barbour Jackets Outlet](,
Cheap Monster Beats](,
Polo Outlet Online](,
Uggs Outlet](,
Woolrich Jackets Outlet](,
Air Jordan Shoes](,
Monster Beats Outlet](,
Cheap Canada Goose Parka](,
Coach Outlet Online](,
Coach Outlet](,](,
Coach Store Outlet](,
Coach Factory Store](,
Coach Factory Outlet Online](,
Coach Bags Outlet](,
Coach Outlet Online USA](,
Coach Purses Outlet](,
Hermes Outlet](,
Black Friday 2014](,
Coach Factory Outlet](,
Moncler Oultet](,
Coach Purses Outlet Online](,
Polo Outlet](,
Coach Outlet USA](,
Ferragamo Shoes Outlet](,