Question

我正在尝试根据css类清理多余的html。我不想删除某种类型的所有标签，只删除特定标签，并且我希望将内容保留在其中。我正在尝试各种变化：

$content = preg_replace(
    '#(<div class\=\"removethis\">(^.*)</div>)#is', 
    '', 
    $content
);

我意识到上面的代码无法正常工作，但希望它能帮助解释我正在尝试做的事情。我对正则表达式并不熟悉，所以到目前为止我还没有找到任何有用的东西。

Answer 1

^可能在那里出错。这是寻找主题的开始，甚至不是一条线的开始;并且不会发生在这个位置。

您要用''替换它，而不是第一个'$1'捕获组的内容。

回答主题的偏离主题的答案：您也可以使用querypath或其他库来管理html内容。然后替换变得更简单：

  htmlqp($html)->remove("div.removethis")->...()->writeHTML();

通常不适合输出转换。但在其他情况下更容易和更有用。

Answer 2

免责声明：不要使用正则表达式！

不建议使用正则表达式来解析HTML（或任何其他非常规语言）。解决方案失败有许多陷阱和方法。也就是说，我非常喜欢使用正则表达式来解决复杂问题，例如涉及嵌套结构的问题。如果其他人提供了有效的非正则表达式解决方案，我建议你使用那个，而不是以下。

正则表达式解决方案：

以下解决方案实现了一个递归正则表达式，它与preg_replace_callback()函数一起使用（当DIV元素的内容包含嵌套的DIV元素时，它会递归调用自身）。正则表达式匹配最外面的DIV元素（可能包含嵌套的DIV元素）。回调函数仅剥离具有包含以下内容的类属性的那些DIV元素的开始和结束标记：removethis。保留没有removethis类的DIV标记。（removethis值存储在以下工作脚本顶部的变量中，可以轻松更改以适应。）我认为您会发现这样做非常好：

function stripSpecialDivTags（$ text）

<?php // test.php Rev:20111219_1600
// Remove DIV start and end tags having this class attribute:
$class_to_remove = "removethis";
// Recursive regex matches an outermost DIV element and its contents.
$re = '% # Match outermost DIV element.
    <                     # Start of HTML start tag
    (                     # $1: DIV element start tag.
      div                 # Tag name = DIV
      (                   # $2: DIV start tag attributes.
        (?:               # Group for zero or more attributes.
          \s+             # Required whitespace precedes attrib.
          [\w.\-:]+       # Attribute name.
          (?:             # Group for optional attribute value.
            \s*=\s*       # Name and value separated by =
            (?:           # Group for value alternatives.
              \'[^\']*\'  # Either single quoted,
            | "[^"]*"     # or double quoted,
            | [\w.\-:]+   # or unquoted value.
            )             # End group of value alternatives.
          )?              # Attribute value is optional.
        )*                # Zero or more attributes.
      )                   # End $2: DIV start tag attributes.
      \s*                 # Optional whitespace before closing >.
      >                   # End DIV element start tag.
    )                     # End $1: DIV element start tag.
    (                     # $3: DIV element contents.
      (?:                 # Group for zero or more content alts.
        (?R)              # Either a nested DIV element.
      |                   # or non-DIV tag stuff.
        [^<]*             # {normal*} Non-< start of tag stuff.
        (?:               # Begin "unrolling-the-loop".
          <               # {special} A "<", but only if it is
          (?:!/?div)      # NOT start of a <div or </div
          [^<]*           # more {normal*} Non-< start of tag.
        )*                # End {(special normal*)*} construct.
      )*                  # Zero or more content alternatives.
    )                     # End $3: DIV element contents.
    </div\s*>             # DIV element end tag.
    %xi';

// Remove matching start and end tags of DIV elements having specific class.
function stripSpecialDivTags($text) {
    global $re;
    $text = preg_replace_callback($re,
            '_stripSpecialDivTags_cb', $text);
    $text = str_replace("<\0", '<', $text);
    return $text;
}
function _stripSpecialDivTags_cb($matches) {
    global $re, $class_to_remove;
    if (preg_match($re, $matches[3])) {
        $matches[3] = preg_replace_callback($re,
            '_stripSpecialDivTags_cb', $matches[3]);
    }
    // Regex to match class attribute and capture value in $1.
    $re_class = '/ ^      # Anchor to start of attributes string.
        (?:               # Zero or more non-class attributes.
          \s+             # Required whitespace precedes attrib.
          (?!class\b)     # Match any attribute other than "CLASS".
          [\w.\-:]+       # Attribute name.
          (?:             # Group for optional attribute value.
            \s*=\s*       # Name and value separated by =.
            (?:           # Group for value alternatives.
              \'[^\']*\'  # Either single quoted,
            | "[^"]*"     # or double quoted,
            | [\w.\-:]+   # or unquoted value.
            )             # End group of value alternatives.
          )?              # Attribute value is optional.
        )*                # Zero or more non-class attributes.
        \s+               # Required whitespace precedes attrib.
        class\s*=\s*      # "CLASS" is the attribute we need.
        (?|               # Use branch reset to capture value in $1.
          \'([^\']*)\'    # Either $1.1: a single quoted,
        | "([^"]*)"       # or $1.2: a double quoted,
        | ([\w.\-:]+)     # or $1.3: an un-quoted value.
        )                 # End branch reset to capture value in $1.
        /ix';
    $re_remove = '%(?<=^|\s)'.preg_quote($class_to_remove, '%').'(?=\s|$)%';
    if (preg_match($re_class, $matches[2], $m)) {// If DIV has a CLASS,
        if (preg_match($re_remove, $m[1])) { // AND it has special value,
            return $matches[3];     // Then strip start and end DIV tags.
        }
    }
    // Hide the start and end tags by inserting a temporary null char.
    return "<\0". $matches[1] . $matches[3] . "<\0/div>";
}
$data = file_get_contents('testdata.html');
$output = stripSpecialDivTags($data);
file_put_contents('testdata_out.html', $output);
?>

示例输入：

<div class="do not remove">
    <div class=removethis>
        <div>
            <div class='do removethis one too'>
                <div class="dontremovethisone">
                </div>
            </div>
        </div>
    </div>
</div>

示例输出：

<div class="do not remove">

        <div>

                <div class="dontremovethisone">
                </div>

        </div>

</div>

正则表达式的复杂性是正确处理具有可能包含<>尖括号的值的标记属性所必需的。

Answer 3

不要使用正则表达式解析HTML。您应该使用strip_tags

$html = '<div class="foo">Hello world. <b>I am bold!</b></div>';

$allowed_tags = "<b>";

$text = strip_tags($html, $allowed_tags);

echo $text; #=> Hello world. <b>I am bold!</b>

使用preg_replace删除特定的html标记而不删除内容

3 个答案:

免责声明：不要使用正则表达式！

正则表达式解决方案：

function stripSpecialDivTags（$ text）

示例输入：

示例输出：