Question

我有一个字符串，如下所示

<p>&nbsp;Hello World, this is StackOverflow&#39;s question details page</p>

我想从上面的HTML中提取文字为Hello World, this is StackOverflow's question details page，但我也要删除 。

我们如何在PHP中实现这一点，我尝试了很少的函数，strip_tags，html_entity_decode等，但在某些情况下都失败了。

请帮助，谢谢！

已编辑我正在尝试的代码如下所示，但它不起作用:(它保留 和'这类字符。

$TMP_DESCR = trim(strip_tags($rs['description']));

Answer 1

以下为我工作......不得不在不间断的空间上做str_replace。

$string = "<p>&nbsp;Hello World, this is StackOverflow&#39;s question details page</p>";
echo htmlspecialchars_decode(trim(strip_tags(str_replace('&nbsp;', '', $string))), ENT_QUOTES);

Answer 2

strip_tags()将删除标记，trim()应该删除空格。我不确定它是否适用于不间断的空间。

Answer 3

首先，您必须在HTML上调用trim（）以删除空白区域。 http://php.net/manual/en/function.trim.php

然后strip_tags，然后是html_entity_decode。

所以：html_entity_decode(strip_tags(trim(html)));

Answer 4

最好和最可靠的方法可能是真正的（X | HT）ML解析函数，如DOMDocument类：

<?php

$str = "<p>&nbsp;Hello World, this is StackOverflow&#39;s question details page</p>";

$dom = new DOMDocument;
$dom->loadXML(str_replace('&nbsp;', ' ', $str));

echo trim($dom->firstChild->nodeValue);
// "Hello World, this is StackOverflow's question details pages"

对于这个问题，这可能有点过分，但使用正确的解析功能是一个很好的习惯。

编辑：您可以重复使用DOMDocument对象，因此循环中只需要两行：

$dom = new DOMDocument;
while ($rs = mysql_fetch_assoc($result)) { // or whatever
    $dom->loadHTML(str_replace('&nbsp;', ' ', $rs['description']));
    $TMP_DESCR = $dom->firstChild->nodeValue;

    // do something with $TMP_DESCR
}

从html中提取文本？

4 个答案: