我有功能
function remove_font_tags_without_attr($html)
{
$pattern = "/<font[\s]*?>(.*?)<\/font[\s]*>/im";
while(preg_match($pattern, $html)) {
$html = preg_replace($pattern, "$1", $html);
}
return $html;
}
和html输入
$html=
<p>
First: 0<font>1<font>2</font>3</font>4
Second: 0<font style="color:red">1<font>2</font>3</font>4
Third: 0<font>1<font style="color:green">2</font>3</font>4
Fourth: 0<font style="color:red">1<font style="color:green">2</font>3</font>4
</p>
我需要删除没有属性的所有字体标签
我的上述功能返回
<p>
First: 01234
Second: 0<font style="color:red">123</font>4
Third: 01<font style="color:green">23</font>4
Fourth: 0<font style="color:red">1<font style="color:green">2</font>3</font>4
</p>
但问题在于第三行第三,正确的回报是
01<font style="color:green">2</font>34
完成正确的结果:
<p>
First: 01234
Second: 0<font style="color:red">123</font>4
Third: 01<font style="color:green">2</font>34
Fourth: 0<font style="color:red">1<font style="color:green">2</font>3</font>4
</p>
你能帮我吗?
答案 0 :(得分:2)
不建议使用正则表达式来解析HTML(或任何其他非常规语言)。解决方案失败有许多陷阱和方法。也就是说,我非常喜欢使用正则表达式来解决复杂问题,例如涉及嵌套结构的问题。如果其他人提供了有效的非正则表达式解决方案,我建议你使用那个,而不是以下。
以下解决方案实现了一个递归正则表达式,它与preg_replace_callback()
函数一起使用(当FONT元素的内容包含嵌套的FONT元素时,它递归调用自身)。正则表达式匹配最外面的FONT元素(可能包含嵌套的FONT元素)。回调函数仅剥离那些没有属性的FONT元素的开始和结束标记。保留具有属性的FONT标记。我想你会发现这做得很好:
<?php // test.php Rev:20111219_1100
// Recursive regex matches an outermost FONT element and its contents.
$re = '% # Match outermost FONT element.
< # Start of HTML start tag
( # $1: FONT element start tag.
font # Tag name = FONT
( # $2: FONT start tag attributes.
(?: # Group for zero or more attributes.
\s+ # Required whitespace precedes attrib.
[\w.\-:]+ # Attribute name.
(?: # Group for optional attribute value.
\s*=\s* # Name and value separated by =
(?: # Group for value alternatives.
\'[^\']*\' # Either single quoted,
| "[^"]*" # or double quoted,
| [\w.\-:]+ # or unquoted value.
) # End group of value alternatives.
)? # Attribute value is optional.
)* # Zero or more attributes.
) # End $2: FONT start tag attributes.
\s* # Optional whitespace before closing >.
> # End FONT element start tag.
) # End $1: FONT element start tag.
( # $3: FONT element contents.
(?: # Group for zero or more content alts.
(?R) # Either a nested FONT element.
| # or non-FONT tag stuff.
[^<]* # {normal*} Non-< start of tag stuff.
(?: # Begin "unrolling-the-loop".
< # {special} A "<", but only if it is
(?:!/?font) # NOT start of a <font or </font
[^<]* # more {normal*} Non-< start of tag.
)* # End {(special normal*)*} construct.
)* # Zero or more content alternatives.
) # End $3: FONT element contents.
</font\s*> # FONT element end tag.
%xi';
// Remove matching start and end tags of FONT elements having no attributes.
function remove_font_tags_without_attr($text) {
global $re;
$text = preg_replace_callback($re,
'_remove_font_tags_without_attr_cb', $text);
$text = str_replace("<\0", '<', $text);
return $text;
}
function _remove_font_tags_without_attr_cb($matches) {
global $re;
if (preg_match($re, $matches[3])) {
$matches[3] = preg_replace_callback($re,
'_remove_font_tags_without_attr_cb', $matches[3]);
}
if ($matches[2] == '') { // If this FONT tag has no attributes,
return $matches[3]; // Then strip both start and end tag.
}
// Hide the start and end tags by inserting a temporary null char.
return "<\0". $matches[1] . $matches[3] . "<\0/font>";
}
$data = file_get_contents('testdata.html');
$output = remove_font_tags_without_attr($data);
file_put_contents('testdata_out.html', $output);
?>
<font attrib="value">
<font>
<font attrib="value">
<font>
<font attrib="value">
</font>
</font>
</font>
</font>
</font>
<font attrib="value">
<font attrib="value">
<font attrib="value">
</font>
</font>
</font>
正则表达式的复杂性是正确处理具有可能包含<>
尖括号的值的标记属性所必需的。
答案 1 :(得分:1)
让它变得贪婪:
$pattern = "/<font[\s]*?>(.*)<\/font[\s]*>/im";
贪婪: *(星号)重复前一项零次或多次。贪婪,因此在尝试使用前一项目的匹配较少的排列之前,将匹配尽可能多的项目,直到前一项目根本不匹配为止。
答案 2 :(得分:0)
如果php不能这样做那么它就不能。我打算尝试这样做,如果可以,将回发php代码。 Perl代码只是我尝试它的模板。
修改强>
删除了Perl代码,添加了PHP代码。 Ideone测试用例在http://www.ideone.com/9b2Ap
扩展正则表达式 -
$regex = "~
$comment
| ( #1
(?:
$open
| ($openatt) #2
)
( #3
(?: $comment
| (?> (?:(?!$openclose|$comment) . )+ )
| (?1)
)*
)
($close) #4
)
~xs";
php -
<?php
//##
$html = '
<font>
<p>
First: _0<font>_1<font>_2</font>_3</font>_4
Second: _5<font style="color:red">_6<font>_7</font>_8</font>_9
Third: _10<font>_11<font style="color:green">_12</font>_13</font>_14
Fourth: _15<font style="color:red">_16<font style="color:green">_17</font>_18</font>_19
</p>
</font>
';
//##
$comment = '<!--.*?-->';
$open = '<font\s*>';
$openatt = '<font \s+(?:".*?"|\'.*?\'|[^>]*?)+ (?<!/)>';
$close = '</font\s*>';
$openclose = '</?font (?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)? (?<!/)>';
$regex = "~
$comment
| ( #1
(?: $open | ($openatt) ) #2
( (?:$comment | (?>(?:(?!$openclose|$comment).)+) | (?1))* ) #3
($close) #4
)
~xs";
//##
print "Before:\n$html\n\n";
$html = remove_font_tags_without_attr( $html );
print "After:\n$html\n";
exit;
//##
function
remove_font_tags_without_attr( $html_seg )
{
global $regex;
return preg_replace_callback( $regex, 'check_attr_cb', $html_seg );
}
function
check_attr_cb( $matches )
{
if ($matches[1] == '')
return $matches[0];
$begin = $matches[2];
$core = $matches[3];
$end = $matches[4];
if ($begin == '')
$end = '';
return $begin . (remove_font_tags_without_attr( $core )) . $end;
}
?>