在HTML源代码中,我需要提取FONT标记内的任何简单文本,其中包含以下任意顺序(不多于,不少于)这3个属性:size = 5,color =“red”,face =“verdana”
因此,正则表达式必须提取除最后四个之外的所有以下“randomtext”。
<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>
我通过使用3个预测来解决“任何顺序”问题:
<font(?=[^>]* size=5)(?=[^>]* color="red")(?=[^>]* face="verdana")[^>]*>([^<]+)</font>
...或更多html灵活性:
<\s*font(?=[^>]*\s+size\s*=\s*5)(?=[^>]*\scolor\s*=\s*["']red["'])(?=[^>]*\sface\s*=\s*["']verdana["'])[^>]*>\s*([^<]+?)\s*<\s*/font\s*>
问题是它也匹配最后三个。 如何排除那些匹配? (显然是以一般而合理的短/有效方式,即没有纠正所有可能的积极组合,也没有使用只对我的例子有效的字面否定表达式)
答案 0 :(得分:1)
一种方式,也是根据谁说regexp不是工作的工具:
script.pl
的内容(里面有正则表达式并解释过):
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Text matched: %s\t (original string: %s)\n], $1, $& if
m/
# At begin of line, '<' character plus optional space.
\A < \s*
# Literal 'font' word.
font
# Mandatory space.
\s+
# Positive look-ahead for string 'size=5'
(?= .* size \s* = \s* 5 (?:\s+|>) )
# Positive look-ahead for string 'face="verdana"'
(?= .* face \s* = \s* "verdana" (?:\s+|>) )
# Positive look-ahead for string 'color="red"'
(?= .* color \s* = \s* "red" (?:\s+|>) )
# If last three look-ahead succeed, match them.
(?:size\s*=\s*5\s*|color\s*=\s*"red"\s*|face\s*=\s*"verdana"\s*){3}
# Literal '>' character.
>
# Text between tags.
([^>]+)
# Close tag and match end of string.
<\/font> \Z
/x;
}
__DATA__
<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>
像以下一样运行:
perl script.pl
以下结果:
Text matched: randomtext (original string: <font size=5 color="red" face="verdana">randomtext</font>)
Text matched: randomtext (original string: <font size=5 face="verdana" color="red">randomtext</font>)
Text matched: randomtext (original string: <font color="red" size=5 face="verdana">randomtext</font>)
Text matched: randomtext (original string: <font color="red" face="verdana" size=5>randomtext</font>)
Text matched: randomtext (original string: <font face="verdana" size=5 color="red">randomtext</font>)
Text matched: randomtext (original string: <font face="verdana" color="red" size=5>randomtext</font>)
答案 1 :(得分:0)
你认识到这很难吗?如果您有其他可能性,请使用它!
对于正则表达式,请尝试以下方法:
<font(?=[^>]* size=5)(?=[^>]* color="red")(?=[^>]* face="verdana")(?![^>]*(?<!color|size|face)=)(?:\s+[^>\s=]+=[^>\s=]+\s*)+>([^<]+)</font>
我添加/更改了两件事:
(?![^>]*(?<!color|size|face)=)
是一个负向前瞻,在断言后面有一个嵌套的负面外观,当前面没有颜色,大小或面时,它不允许等号。
我将与属性匹配的[^>]*
更改为(?:\s+[^>\s=]+=[^>\s=]+\s*)+
,以便它只匹配不包含等号的非空白序列。