Question

我有功能

function remove_font_tags_without_attr($html)
{  
  $pattern = "/<font[\s]*?>(.*?)<\/font[\s]*>/im";    
  while(preg_match($pattern, $html)) {
    $html = preg_replace($pattern, "$1", $html);
  } 
  return $html;  
}

和html输入

$html=
<p>
First: 0<font>1<font>2</font>3</font>4
Second: 0<font style="color:red">1<font>2</font>3</font>4
Third: 0<font>1<font style="color:green">2</font>3</font>4
Fourth: 0<font style="color:red">1<font style="color:green">2</font>3</font>4
</p>

我需要删除没有属性的所有字体标签

我的上述功能返回

<p>
First: 01234
Second: 0<font style="color:red">123</font>4
Third: 01<font style="color:green">23</font>4
Fourth: 0<font style="color:red">1<font style="color:green">2</font>3</font>4
</p>

但问题在于第三行第三，正确的回报是

01<font style="color:green">2</font>34

完成正确的结果：

<p>
First: 01234
Second: 0<font style="color:red">123</font>4
Third: 01<font style="color:green">2</font>34
Fourth: 0<font style="color:red">1<font style="color:green">2</font>3</font>4
</p>

你能帮我吗？

Answer 1

免责声明：不要使用正则表达式！

不建议使用正则表达式来解析HTML（或任何其他非常规语言）。解决方案失败有许多陷阱和方法。也就是说，我非常喜欢使用正则表达式来解决复杂问题，例如涉及嵌套结构的问题。如果其他人提供了有效的非正则表达式解决方案，我建议你使用那个，而不是以下。

正则表达式解决方案：

以下解决方案实现了一个递归正则表达式，它与preg_replace_callback()函数一起使用（当FONT元素的内容包含嵌套的FONT元素时，它递归调用自身）。正则表达式匹配最外面的FONT元素（可能包含嵌套的FONT元素）。回调函数仅剥离那些没有属性的FONT元素的开始和结束标记。保留具有属性的FONT标记。我想你会发现这做得很好：

function remove_font_tags_without_attr（$ text）

<?php // test.php Rev:20111219_1100
// Recursive regex matches an outermost FONT element and its contents.
$re = '% # Match outermost FONT element.
    <                     # Start of HTML start tag
    (                     # $1: FONT element start tag.
      font                # Tag name = FONT
      (                   # $2: FONT start tag attributes.
        (?:               # Group for zero or more attributes.
          \s+             # Required whitespace precedes attrib.
          [\w.\-:]+       # Attribute name.
          (?:             # Group for optional attribute value.
            \s*=\s*       # Name and value separated by =
            (?:           # Group for value alternatives.
              \'[^\']*\'  # Either single quoted,
            | "[^"]*"     # or double quoted,
            | [\w.\-:]+   # or unquoted value.
            )             # End group of value alternatives.
          )?              # Attribute value is optional.
        )*                # Zero or more attributes.
      )                   # End $2: FONT start tag attributes.
      \s*                 # Optional whitespace before closing >.
      >                   # End FONT element start tag.
    )                     # End $1: FONT element start tag.
    (                     # $3: FONT element contents.
      (?:                 # Group for zero or more content alts.
        (?R)              # Either a nested FONT element.
      |                   # or non-FONT tag stuff.
        [^<]*             # {normal*} Non-< start of tag stuff.
        (?:               # Begin "unrolling-the-loop".
          <               # {special} A "<", but only if it is
          (?:!/?font)     # NOT start of a <font or </font
          [^<]*           # more {normal*} Non-< start of tag.
        )*                # End {(special normal*)*} construct.
      )*                  # Zero or more content alternatives.
    )                     # End $3: FONT element contents.
    </font\s*>            # FONT element end tag.
    %xi';

// Remove matching start and end tags of FONT elements having no attributes.
function remove_font_tags_without_attr($text) {
    global $re;
    $text = preg_replace_callback($re,
            '_remove_font_tags_without_attr_cb', $text);
    $text = str_replace("<\0", '<', $text);
    return $text;
}
function _remove_font_tags_without_attr_cb($matches) {
    global $re;
    if (preg_match($re, $matches[3])) {
        $matches[3] = preg_replace_callback($re,
            '_remove_font_tags_without_attr_cb', $matches[3]);
    }
    if ($matches[2] == '') {    // If this FONT tag has no attributes,
        return $matches[3];     // Then strip both start and end tag.
    }
    // Hide the start and end tags by inserting a temporary null char.
    return "<\0". $matches[1] . $matches[3] . "<\0/font>";
}
$data = file_get_contents('testdata.html');
$output = remove_font_tags_without_attr($data);
file_put_contents('testdata_out.html', $output);
?>

示例输入：

<font attrib="value">
    <font>
        <font attrib="value">
            <font>
                <font attrib="value">
                </font>
            </font>
        </font>
    </font>
</font>

示例输出：

<font attrib="value">

        <font attrib="value">

                <font attrib="value">
                </font>

        </font>

</font>

正则表达式的复杂性是正确处理具有可能包含<>尖括号的值的标记属性所必需的。

Answer 2

让它变得贪婪：

$pattern = "/<font[\s]*?>(.*)<\/font[\s]*>/im";

贪婪： *（星号）重复前一项零次或多次。贪婪，因此在尝试使用前一项目的匹配较少的排列之前，将匹配尽可能多的项目，直到前一项目根本不匹配为止。

Answer 3

如果php不能这样做那么它就不能。我打算尝试这样做，如果可以，将回发php代码。 Perl代码只是我尝试它的模板。

修改
删除了Perl代码，添加了PHP代码。 Ideone测试用例在http://www.ideone.com/9b2Ap

扩展正则表达式 -

$regex = "~ $comment | ( #1 (?: $open | ($openatt) #2 ) ( #3 (?: $comment | (?> (?:(?!$openclose|$comment) . )+ ) | (?1) )* ) ($close) #4 ) ~xs";

php -

<?php //## $html = ' First: _0_1_2_3_4 Second: _5_6_7_8_9 Third: _10_11_12_13_14 Fourth: _15_16_17_18_19 '; //## $comment = ''; $open = '<font\s*>'; $openatt = ']*?)+ (?<!/)>'; $close = '</font\s*>'; $openclose = '</?font (?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)? (?<!/)>'; $regex = "~ $comment | ( #1 (?: $open | ($openatt) ) #2 ( (?:$comment | (?>(?:(?!$openclose|$comment).)+) | (?1))* ) #3 ($close) #4 ) ~xs"; //## print "Before:\n$html\n\n"; $html = remove_font_tags_without_attr( $html ); print "After:\n$html\n"; exit; //## function remove_font_tags_without_attr( $html_seg ) { global $regex; return preg_replace_callback( $regex, 'check_attr_cb', $html_seg ); } function check_attr_cb( $matches ) { if ($matches[1] == '') return $matches[0]; $begin = $matches[2]; $core = $matches[3]; $end = $matches[4]; if ($begin == '') $end = ''; return $begin . (remove_font_tags_without_attr( $core )) . $end; } ?>

删除没有属性的HTML标签（php）

3 个答案:

免责声明：不要使用正则表达式！

正则表达式解决方案：

function remove_font_tags_without_attr（$ text）

示例输入：

示例输出：