Many folks attempt a simple-minded regular expression approach, like
s/<.*?>//g
, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or
HTML comment may be present. Plus folks forget to convert entities, like
<
for example.
Here's one ``simple-minded'' approach, that works for most files:
#!/usr/bin/perl -p0777 s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
If you want a more complete solution, see the 3-stage striphtml program in http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz .