erlug
[Top] [All Lists]

Re: [Erlug] awk & regexp

To: <erlug@xxxxxxxxxxxxxx>
Subject: Re: [Erlug] awk & regexp
From: "Alessandro Forghieri" <alf@xxxxxxxx>
Date: Thu, 14 Feb 2002 10:54:17 +0100
Saluti.

> devo seccare alcune tag HTML ma * è troppo goloso.

Questo e' esattamente uno dei motivi per cui  e' consigliabile usare un
parser "vero" e non cercare
di sintetizzarne uno tramite regexp. Altri problemi sono angolari in
attributi tra virgolette, tag in commenti.... angolari chiuse su una riga
diversa etc.

Cito da perldoc -q HTML (nota che le regex si possono convertire da perl  a
sed - ma puo' non
 essere banale) - specialmente interessanti sono i casi difficili (ma legali
e perfino comuni citati alla fine)

----

  How do I remove HTML from a string?

            The most correct way (albeit not the fastest) is to use
            HTML::Parser from CPAN. Another mostly correct way is to use
            HTML::FormatText which not only removes HTML but also attempts
            to do a little simple formatting of the resulting plain text.

            Many folks attempt a simple-minded regular expression approach,
            like "s/<.*?>//g", but that fails in many cases because the tags
            may continue over line breaks, they may contain quoted
            angle-brackets, or HTML comment may be present. Plus, folks
            forget to convert entities--like "&lt;" for example.

            Here's one "simple-minded" approach, that works for most files:

                #!/usr/bin/perl -p0777
                s/<(?:[^>'"]*|(['"]).*?\1)*>//gs


            If you want a more complete solution, see the 3-stage striphtml
            program in
            http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striph
            tml.gz .

            Here are some tricky cases that you should think about when
            picking a solution:

                <IMG SRC = "foo.gif" ALT = "A > B">

                <IMG SRC = "foo.gif"
                     ALT = "A > B">

                <!-- <A comment> -->

                <script>if (a<b && a>c)</script>

                <# Just data #>

                <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

            If HTML comments include other tags, those solutions would also
            break on text like this:

                <!-- This section commented out.
                    <B>You can't see me!</B>
                -->
----

Saluti,
alf




<Prev in Thread] Current Thread [Next in Thread>