Question

我有一个关于解析文本和删除不需要的html部分的问题。我知道像strip_tags（）这样的函数会删除所有标签，但问题是，这个函数会在那里留下“内部文本”。

让我举个例子，我们有一个文字：

Hello, how are you? <a href="">Link to my website</a> __Here continues html tags, links, images__

我想要的是删除html所在的整个部分。不仅是标签，还有文字（如上面的“链接到我的网站”）。

有没有任何有效的方法，我错过的功能？

Answer 1

试试这个：

function removeTags($str) {
    $result = '';

    $xpath = new DOMXPath(DOMDocument::loadHTML(sprintf('<body>%s</body>', $str)));
    foreach ($xpath->query('//body/text()') as $textNode) {
        $result .= $textNode->nodeValue;
    }

    return $result;
}

echo removeTags(
    'Hello, how are you? <a href="">Link to my website</a> __Here continues html <span>tags</span>, links, images__'
);

输出：

Hello, how are you? __Here continues html , links, images__

Answer 2

为什么不规定提交的输入不允许包含标记。

function containsIllegalHtml($input, $allowable_tags = '') {
    if($input != strip_tags($input, $allowable_tags)) {
        return true;
    } else {
        return false;
    }
}

使用此功能检查输入是否包含标签。

Answer 3

你可以编写一个带字符串的函数它使用php字符串功能来获取“＆lt;”的位置然后是“＆gt;”的位置并从输入字符串中删除它们

Answer 4

也许这不正确，但是......

$str = 'Hello, how are you? <a href="">Link to my website</a> __Here continues html tags, links, ';
$rez = preg_replace("/\<.*\>/i",'',$str);
var_dump($rez);

给了我一个输出

string 'Hello, how are you?  __Here continues html tags, links, ' (length=56)

Answer 5

我搜索并找到了这个解决方案

$txt = "
<html>
<head><title>Something wicked this way comes</title></head>
<body>
This is the interesting stuff I want to extract
</body>
</html>";

$text = preg_replace("/<([^<>]*)>/", "", $txt);

echo htmlentities($text);

Answer 6

一些preg魔法？

$text = preg_replace('/<[\/\!]*?[^<>]*?>/si', '', $text);

Answer 7

也许这会奏效：

http://htmlpurifier.org/

这是教程

http://www.zendcasts.com/writing-custom-zend-filters-with-htmlpurifier/2011/06/

适用于Zend Framework，但我认为这可能会有所帮助

如何在PHP中删除文本的html部分

7 个答案: