PHP:从字符串中删除div标记,除了某些类但不包含其内容

时间:2014-01-22 17:12:50

标签: php regex domdocument

我想在PHP中找到一种方法从字符串中删除一些标签。 我有这个字符串:

hello
<div class="test-1 safe">Hi everybody</div>
<div>Hello world</div>
<p>Hi guys, this is a text</p>
<div class="test">this is another text</div>

我正在尝试编写一个方法来删除字符串中的所有div标记,除了那些具有“安全”类+删除安全类的字符串。 例如,我想在这种情况下输出:

hello
<div class="test-1">Hi everybody</div>
Hello world
<p>Hi guys, this is a text</p>
this is another text

我从reg_ex开始:

public static function clean_text($text, $parent = '')
{

    $cleanText = preg_replace("/<\/?div[^>]*\>/i", "", $cleanText);
    return $cleanText;
}

但它删除了所有div。 然后,我转移到DomDocument,但我仍然有问题(插入了doctype和编码问题等html标签)。

public static function clean_text($text, $parent = '')
{
    //some unnecessary code before...
    $cleanText = $text;

    //parsing DOM
    $dom = new \DOMDocument();
    $dom->loadHTML($cleanText);

    $divs = $dom->getElementsByTagName('div');
    $i = $divs->length - 1;
    while ($i > -1) {
        $div = $divs->item($i);
        if ($div->hasAttribute('class') && strstr($div->getAttribute('class'), 'safe'))
        {
            $class = $div->getAttribute('class');
            $class = str_replace('safe','',$class);
            $div->removeAttribute('class');
            $div->setAttribute('class',$class);
        }
        else
        {
            $txt = $div->nodeValue;
            $newelement = $dom->createTextNode($txt);
            $div->parentNode->replaceChild($newelement, $div);
        }
        $i--;
    }

    $text = $dom->saveHTML();

    return $text;
}

有最简单的方法吗?

非常感谢你的帮助。

1 个答案:

答案 0 :(得分:0)

你可以用negative lookahead

来做到这一点
$pattern = array(

// replace divs not followed by class ... safe
'~<div(?![^>]*class="[^"]+ safe")[^>]*>(.*?)</div>~s',

// then remove safe
'~(<div[^>]+class="[^"]+) safe"~s');

$replace = array('\1', '\1"');

$str = preg_replace($pattern, $replace, $str);
echo "<pre>".htmlspecialchars($str)."</pre>";

输出

hello
<div class="test-1">Hi everybody</div>
Hello world
<p>Hi guys, this is a text</p>
this is another text