Re: [Erlug] awk & regexp

"Alessandro Forghieri" <alf@xxxxxxxx> · Thu, 14 Feb 2002 10:54:17 +0100

Saluti.

> devo seccare alcune tag HTML ma * è troppo goloso.

Questo e' esattamente uno dei motivi per cui  e' consigliabile usare un
parser "vero" e non cercare
di sintetizzarne uno tramite regexp. Altri problemi sono angolari in
attributi tra virgolette, tag in commenti.... angolari chiuse su una riga
diversa etc.

Cito da perldoc -q HTML (nota che le regex si possono convertire da perl  a
sed - ma puo' non
 essere banale) - specialmente interessanti sono i casi difficili (ma legali
e perfino comuni citati alla fine)

----

  How do I remove HTML from a string?

            The most correct way (albeit not the fastest) is to use
            HTML::Parser from CPAN. Another mostly correct way is to use
            HTML::FormatText which not only removes HTML but also attempts
            to do a little simple formatting of the resulting plain text.

            Many folks attempt a simple-minded regular expression approach,
            like "s/<.*?>//g", but that fails in many cases because the tags
            may continue over line breaks, they may contain quoted
            angle-brackets, or HTML comment may be present. Plus, folks
            forget to convert entities--like "&lt;" for example.

            Here's one "simple-minded" approach, that works for most files:

                #!/usr/bin/perl -p0777
                s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

            If you want a more complete solution, see the 3-stage striphtml
            program in
            http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striph
            tml.gz .

            Here are some tricky cases that you should think about when
            picking a solution:

                <IMG SRC = "foo.gif" ALT = "A > B">

                <IMG SRC = "foo.gif"
                     ALT = "A > B">

                <!-- <A comment> -->

                <script>if (a<b && a>c)</script>

                <# Just data #>

                <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

            If HTML comments include other tags, those solutions would also
            break on text like this:

                <!-- This section commented out.
                    <B>You can't see me!</B>
                -->
----

Saluti,
alf

To:	<erlug@xxxxxxxxxxxxxx>
Subject:	Re: [Erlug] awk & regexp
From:	"Alessandro Forghieri" <alf@xxxxxxxx>
Date:	Thu, 14 Feb 2002 10:54:17 +0100

<Prev in Thread]	Current Thread	[Next in Thread>
[Erlug] awk & regexp, Ivan Sergio Borgonovo Re: [Erlug] awk & regexp, Nando Santagata Re: [Erlug] awk & regexp, Maurizio Lemmo - Tannoiser Re: [Erlug] awk & regexp, Alessandro Forghieri <= Re: [Erlug] awk & regexp, Ivan Sergio Borgonovo

Previous by Date:	Re: [Erlug] amavis, Maurizio Lemmo - Tannoiser
Next by Date:	Re: [Erlug] Domanda stupida di chi ha troppo sonno....., Maurizio Lemmo - Tannoiser
Previous by Thread:	Re: [Erlug] awk & regexp, Maurizio Lemmo - Tannoiser
Next by Thread:	Re: [Erlug] awk & regexp, Ivan Sergio Borgonovo
Indexes:	[Date] [Thread] [Top] [All Lists]