用于检查html标记中的属性的正则表达式

时间:2012-03-26 10:46:34

标签: regex

在HTML源代码中,我需要提取FONT标记内的任何简单文本,其中包含以下任意顺序(不多于,不少于)这3个属性:size = 5,color =“red”,face =“verdana”

因此,正则表达式必须提取除最后四个之外的所有以下“randomtext”。

<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>

我通过使用3个预测来解决“任何顺序”问题:

<font(?=[^>]* size=5)(?=[^>]* color="red")(?=[^>]* face="verdana")[^>]*>([^<]+)</font>

...或更多html灵活性:

<\s*font(?=[^>]*\s+size\s*=\s*5)(?=[^>]*\scolor\s*=\s*["']red["'])(?=[^>]*\sface\s*=\s*["']verdana["'])[^>]*>\s*([^<]+?)\s*<\s*/font\s*>

问题是它也匹配最后三个。 如何排除那些匹配? (显然是以一般而合理的短/有效方式,即没有纠正所有可能的积极组合,也没有使用只对我的例子有效的字面否定表达式)

2 个答案:

答案 0 :(得分:1)

一种方式,也是根据谁说regexp不是工作的工具:

script.pl的内容(里面有正则表达式并解释过):

use warnings;
use strict;

while ( <DATA> ) {
    printf qq[Text matched: %s\t (original string: %s)\n], $1, $& if 
    m/ 
        # At begin of line, '<' character plus optional space.
        \A < \s*
        # Literal 'font' word.
        font
        # Mandatory space.
        \s+
        # Positive look-ahead for string 'size=5'
        (?= .* size \s* = \s* 5 (?:\s+|>) )   
        # Positive look-ahead for string 'face="verdana"'
        (?= .* face \s* = \s* "verdana" (?:\s+|>) )
        # Positive look-ahead for string 'color="red"'
        (?= .* color \s* = \s* "red" (?:\s+|>) )
        # If last three look-ahead succeed, match them.
        (?:size\s*=\s*5\s*|color\s*=\s*"red"\s*|face\s*=\s*"verdana"\s*){3}
        # Literal '>' character.
        >
        # Text between tags.
        ([^>]+)
        # Close tag and match end of string.
        <\/font> \Z
    /x;
}

__DATA__
<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>

像以下一样运行:

perl script.pl

以下结果:

Text matched: randomtext         (original string: <font size=5 color="red" face="verdana">randomtext</font>)
Text matched: randomtext         (original string: <font size=5 face="verdana" color="red">randomtext</font>)
Text matched: randomtext         (original string: <font color="red" size=5 face="verdana">randomtext</font>)
Text matched: randomtext         (original string: <font color="red" face="verdana" size=5>randomtext</font>)
Text matched: randomtext         (original string: <font face="verdana" size=5 color="red">randomtext</font>)
Text matched: randomtext         (original string: <font face="verdana" color="red" size=5>randomtext</font>)

答案 1 :(得分:0)

你认识到这很难吗?如果您有其他可能性,请使用它!

对于正则表达式,请尝试以下方法:

<font(?=[^>]* size=5)(?=[^>]* color="red")(?=[^>]* face="verdana")(?![^>]*(?<!color|size|face)=)(?:\s+[^>\s=]+=[^>\s=]+\s*)+>([^<]+)</font>

here on Regexr

我添加/更改了两件事:

  1. (?![^>]*(?<!color|size|face)=)是一个负向前瞻,在断言后面有一个嵌套的负面外观,当前面没有颜色,大小或面时,它不允许等号。

  2. 我将与属性匹配的[^>]*更改为(?:\s+[^>\s=]+=[^>\s=]+\s*)+,以便它只匹配不包含等号的非空白序列。