php删除标签与类及其内容

时间:2018-05-18 02:35:13

标签: php regex dom xpath

所以我有这个HTML

<html>
<head>...</head>
<body>
(some js and css)
    <div class="no_remove">(content)</div>
    <div class="no_remove">(content that i didn't want to remove)
        <div class="remove">
            <span>(content)</span>
            <span>(content)</span>
            <span>(content)</span>
            <div class="other1">(content)</div>
            <div class="other2">(content)</div>
            <div class="other3">(content)</div>
        </div>
    </div>
</body>
</html>

和php

$text = file_get_contents($link);
$dom = new DOMDocument();
$dom->loadHTML($text);
$xpath = new DOMXPath($dom);
$div = $xpath->query('//div[@class="no_remove"]');
$result = $dom->saveXML($div->item(1));
$result2 = preg_replace('#<div class="remove">(.*?)</div>#', ' ', $result);
echo $result2;

dom xpath完美地完成了它的工作,
但是“preg_replace”没有删除带有“删除”类的div 我可以从正则表达大师或其他人那里获得一些启示吗?

对不起坏英语

2 个答案:

答案 0 :(得分:0)

您可能需要指定多行修饰符,即s

$result2 = preg_replace('#<div class="remove">(.*?)</div>#s', ' ', $result);

或者,您可以使用[\s\S]代替.来匹配多行。所以,

$result2 = preg_replace('#<div class="remove">([\s\S]*?)</div>#', ' ', $result);

此外,我通常会使用\s+而不是添加空格,以防html有多个空格..所以像:

$result2 = preg_replace('#<div\s+class="remove">([\s\S]*?)</div>#', ' ', $result);

你也可以尝试这样的方法来处理多个属性和其他类型的引号:

$result2 = preg_replace('#<div\b[^>]+\bclass\s*=\s*[\'\"]remove[\'\"][^>]*>([\s\S]*?)</div>#', ' ', $result);

*快速编辑:我添加了\b来识别单词的边框,因此data-class之类的属性不会匹配而不是class属性。

答案 1 :(得分:0)

以下是继续使用正确工具的方法 - 使用DomDocument / Xpath根据类名删除不需要的div :(不要诉诸正则表达式)

代码:(Demo

$html = <<<HTML
<html>
<head>...</head>
<body>
(some js and css)
    <div class="no_remove">(content)</div>
    <div class="no_remove">(content that i didn't want to remove)
        <div class="remove">
            <span>(content)</span>
            <span>(content)</span>
            <span>(content)</span>
            <div class="other1">(content)</div>
            <div class="other2">(content)</div>
            <div class="other3">(content)</div>
        </div>
    </div>
</body>
</html>
HTML;

libxml_use_internal_errors(true);
$dom=new DOMDocument; 
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//div[@class="remove"]') as $div) {
    $div->parentNode->removeChild($div);
}
echo $dom->saveHTML();

输出:

<html>
<head></head><p>...
</p><body>
(some js and css)
    <div class="no_remove">(content)</div>
    <div class="no_remove">(content that i didn't want to remove)

    </div>
</body>
</html>