Saluti.
> devo seccare alcune tag HTML ma * è troppo goloso.
Questo e' esattamente uno dei motivi per cui e' consigliabile usare un
parser "vero" e non cercare
di sintetizzarne uno tramite regexp. Altri problemi sono angolari in
attributi tra virgolette, tag in commenti.... angolari chiuse su una riga
diversa etc.
Cito da perldoc -q HTML (nota che le regex si possono convertire da perl a
sed - ma puo' non
essere banale) - specialmente interessanti sono i casi difficili (ma legali
e perfino comuni citati alla fine)
----
How do I remove HTML from a string?
The most correct way (albeit not the fastest) is to use
HTML::Parser from CPAN. Another mostly correct way is to use
HTML::FormatText which not only removes HTML but also attempts
to do a little simple formatting of the resulting plain text.
Many folks attempt a simple-minded regular expression approach,
like "s/<.*?>//g", but that fails in many cases because the tags
may continue over line breaks, they may contain quoted
angle-brackets, or HTML comment may be present. Plus, folks
forget to convert entities--like "<" for example.
Here's one "simple-minded" approach, that works for most files:
#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
If you want a more complete solution, see the 3-stage striphtml
program in
http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striph
tml.gz .
Here are some tricky cases that you should think about when
picking a solution:
<IMG SRC = "foo.gif" ALT = "A > B">
<IMG SRC = "foo.gif"
ALT = "A > B">
<!-- <A comment> -->
<script>if (a<b && a>c)</script>
<# Just data #>
<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
If HTML comments include other tags, those solutions would also
break on text like this:
<!-- This section commented out.
<B>You can't see me!</B>
-->
----
Saluti,
alf
|