所以我有这个HTML
<html>
<head>...</head>
<body>
(some js and css)
<div class="no_remove">(content)</div>
<div class="no_remove">(content that i didn't want to remove)
<div class="remove">
<span>(content)</span>
<span>(content)</span>
<span>(content)</span>
<div class="other1">(content)</div>
<div class="other2">(content)</div>
<div class="other3">(content)</div>
</div>
</div>
</body>
</html>
和php
$text = file_get_contents($link);
$dom = new DOMDocument();
$dom->loadHTML($text);
$xpath = new DOMXPath($dom);
$div = $xpath->query('//div[@class="no_remove"]');
$result = $dom->saveXML($div->item(1));
$result2 = preg_replace('#<div class="remove">(.*?)</div>#', ' ', $result);
echo $result2;
dom xpath完美地完成了它的工作,
但是“preg_replace”没有删除带有“删除”类的div
我可以从正则表达大师或其他人那里获得一些启示吗?
对不起坏英语
答案 0 :(得分:0)
您可能需要指定多行修饰符,即s
$result2 = preg_replace('#<div class="remove">(.*?)</div>#s', ' ', $result);
或者,您可以使用[\s\S]
代替.
来匹配多行。所以,
$result2 = preg_replace('#<div class="remove">([\s\S]*?)</div>#', ' ', $result);
此外,我通常会使用\s+
而不是添加空格,以防html有多个空格..所以像:
$result2 = preg_replace('#<div\s+class="remove">([\s\S]*?)</div>#', ' ', $result);
你也可以尝试这样的方法来处理多个属性和其他类型的引号:
$result2 = preg_replace('#<div\b[^>]+\bclass\s*=\s*[\'\"]remove[\'\"][^>]*>([\s\S]*?)</div>#', ' ', $result);
*快速编辑:我添加了\b
来识别单词的边框,因此data-class
之类的属性不会匹配而不是class
属性。
答案 1 :(得分:0)
以下是继续使用正确工具的方法 - 使用DomDocument / Xpath根据类名删除不需要的div :(不要诉诸正则表达式)
代码:(Demo)
$html = <<<HTML
<html>
<head>...</head>
<body>
(some js and css)
<div class="no_remove">(content)</div>
<div class="no_remove">(content that i didn't want to remove)
<div class="remove">
<span>(content)</span>
<span>(content)</span>
<span>(content)</span>
<div class="other1">(content)</div>
<div class="other2">(content)</div>
<div class="other3">(content)</div>
</div>
</div>
</body>
</html>
HTML;
libxml_use_internal_errors(true);
$dom=new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//div[@class="remove"]') as $div) {
$div->parentNode->removeChild($div);
}
echo $dom->saveHTML();
输出:
<html>
<head></head><p>...
</p><body>
(some js and css)
<div class="no_remove">(content)</div>
<div class="no_remove">(content that i didn't want to remove)
</div>
</body>
</html>