我正在尝试根据css类清理多余的html。我不想删除某种类型的所有标签,只删除特定标签,并且我希望将内容保留在其中。我正在尝试各种变化:
$content = preg_replace(
'#(<div class\=\"removethis\">(^.*)</div>)#is',
'',
$content
);
我意识到上面的代码无法正常工作,但希望它能帮助解释我正在尝试做的事情。我对正则表达式并不熟悉,所以到目前为止我还没有找到任何有用的东西。
答案 0 :(得分:3)
^
可能在那里出错。这是寻找主题的开始,甚至不是一条线的开始;并且不会发生在这个位置。
您要用''
替换它,而不是第一个'$1'
捕获组的内容。
回答主题的偏离主题的答案:您也可以使用querypath或其他库来管理html内容。然后替换变得更简单:
htmlqp($html)->remove("div.removethis")->...()->writeHTML();
通常不适合输出转换。但在其他情况下更容易和更有用。
答案 1 :(得分:2)
不建议使用正则表达式来解析HTML(或任何其他非常规语言)。解决方案失败有许多陷阱和方法。也就是说,我非常喜欢使用正则表达式来解决复杂问题,例如涉及嵌套结构的问题。如果其他人提供了有效的非正则表达式解决方案,我建议你使用那个,而不是以下。
以下解决方案实现了一个递归正则表达式,它与preg_replace_callback()
函数一起使用(当DIV元素的内容包含嵌套的DIV元素时,它会递归调用自身)。正则表达式匹配最外面的DIV元素(可能包含嵌套的DIV元素)。回调函数仅剥离具有包含以下内容的类属性的那些DIV元素的开始和结束标记:removethis
。保留没有removethis
类的DIV标记。 (removethis
值存储在以下工作脚本顶部的变量中,可以轻松更改以适应。)我认为您会发现这样做非常好:
<?php // test.php Rev:20111219_1600
// Remove DIV start and end tags having this class attribute:
$class_to_remove = "removethis";
// Recursive regex matches an outermost DIV element and its contents.
$re = '% # Match outermost DIV element.
< # Start of HTML start tag
( # $1: DIV element start tag.
div # Tag name = DIV
( # $2: DIV start tag attributes.
(?: # Group for zero or more attributes.
\s+ # Required whitespace precedes attrib.
[\w.\-:]+ # Attribute name.
(?: # Group for optional attribute value.
\s*=\s* # Name and value separated by =
(?: # Group for value alternatives.
\'[^\']*\' # Either single quoted,
| "[^"]*" # or double quoted,
| [\w.\-:]+ # or unquoted value.
) # End group of value alternatives.
)? # Attribute value is optional.
)* # Zero or more attributes.
) # End $2: DIV start tag attributes.
\s* # Optional whitespace before closing >.
> # End DIV element start tag.
) # End $1: DIV element start tag.
( # $3: DIV element contents.
(?: # Group for zero or more content alts.
(?R) # Either a nested DIV element.
| # or non-DIV tag stuff.
[^<]* # {normal*} Non-< start of tag stuff.
(?: # Begin "unrolling-the-loop".
< # {special} A "<", but only if it is
(?:!/?div) # NOT start of a <div or </div
[^<]* # more {normal*} Non-< start of tag.
)* # End {(special normal*)*} construct.
)* # Zero or more content alternatives.
) # End $3: DIV element contents.
</div\s*> # DIV element end tag.
%xi';
// Remove matching start and end tags of DIV elements having specific class.
function stripSpecialDivTags($text) {
global $re;
$text = preg_replace_callback($re,
'_stripSpecialDivTags_cb', $text);
$text = str_replace("<\0", '<', $text);
return $text;
}
function _stripSpecialDivTags_cb($matches) {
global $re, $class_to_remove;
if (preg_match($re, $matches[3])) {
$matches[3] = preg_replace_callback($re,
'_stripSpecialDivTags_cb', $matches[3]);
}
// Regex to match class attribute and capture value in $1.
$re_class = '/ ^ # Anchor to start of attributes string.
(?: # Zero or more non-class attributes.
\s+ # Required whitespace precedes attrib.
(?!class\b) # Match any attribute other than "CLASS".
[\w.\-:]+ # Attribute name.
(?: # Group for optional attribute value.
\s*=\s* # Name and value separated by =.
(?: # Group for value alternatives.
\'[^\']*\' # Either single quoted,
| "[^"]*" # or double quoted,
| [\w.\-:]+ # or unquoted value.
) # End group of value alternatives.
)? # Attribute value is optional.
)* # Zero or more non-class attributes.
\s+ # Required whitespace precedes attrib.
class\s*=\s* # "CLASS" is the attribute we need.
(?| # Use branch reset to capture value in $1.
\'([^\']*)\' # Either $1.1: a single quoted,
| "([^"]*)" # or $1.2: a double quoted,
| ([\w.\-:]+) # or $1.3: an un-quoted value.
) # End branch reset to capture value in $1.
/ix';
$re_remove = '%(?<=^|\s)'.preg_quote($class_to_remove, '%').'(?=\s|$)%';
if (preg_match($re_class, $matches[2], $m)) {// If DIV has a CLASS,
if (preg_match($re_remove, $m[1])) { // AND it has special value,
return $matches[3]; // Then strip start and end DIV tags.
}
}
// Hide the start and end tags by inserting a temporary null char.
return "<\0". $matches[1] . $matches[3] . "<\0/div>";
}
$data = file_get_contents('testdata.html');
$output = stripSpecialDivTags($data);
file_put_contents('testdata_out.html', $output);
?>
<div class="do not remove">
<div class=removethis>
<div>
<div class='do removethis one too'>
<div class="dontremovethisone">
</div>
</div>
</div>
</div>
</div>
<div class="do not remove">
<div>
<div class="dontremovethisone">
</div>
</div>
</div>
正则表达式的复杂性是正确处理具有可能包含<>
尖括号的值的标记属性所必需的。
答案 2 :(得分:1)
不要使用正则表达式解析HTML。您应该使用strip_tags
$html = '<div class="foo">Hello world. <b>I am bold!</b></div>';
$allowed_tags = "<b>";
$text = strip_tags($html, $allowed_tags);
echo $text; #=> Hello world. <b>I am bold!</b>