Question

我使用的服务最终会生成一个字符串。字符串通常如下：

Hello &nbsp; Mr &nbsp; John Doe, you are now registered \t.
Hello &nbsb; Mr &nbsp; John Doe, your phone number is &nbsb; 555-555-555 &nbs; \n

我需要删除所有html实体以及所有\ t和\ n等等。

我可以使用html_entity_decode删除空格，并使用str_replace删除\t或\n，但是有更通用的方法吗？有些东西会让你确定字符串中只有字母字符（一些不包含代码的字符串）。

Answer 1

如果我理解你的情况，你基本上想要从HTML转换为纯文本。

根据输入的复杂性以及所需的稳健性和准确性，您有以下几种选择：

使用strip_tags()删除HTML代码，mb_convert_encoding()以HTML-ENTITIES作为源编码来解码实体，使用strtr()或preg_replace()进行额外的替代品：

$html = "<p>Hello &nbsp; Mr &nbsp; John Doe, you are now registered.
    Hello &nbsp; Mr &nbsp; John Doe, your phone number is &nbsp; 555-555-555 &nbsp;
    Test: &euro;/&eacute;</p>";

$plain_text = $html;
$plain_text = strip_tags($plain_text);
$plain_text = mb_convert_encoding($plain_text, 'UTF-8', 'HTML-ENTITIES');
$plain_text = strtr($plain_text, [
    "\t" => ' ',
    "\r" => ' ',
    "\n" => ' ',
]);
$plain_text = preg_replace('/\s+/u', ' ', $plain_text);

var_dump($html, $plain_text);

使用适当的DOM解析器，加上可能preg_replace()进行进一步调整：

$html = "<p>Hello &nbsp; Mr &nbsp; John Doe, you are now registered.
    Hello &nbsp; Mr &nbsp; John Doe, your phone number is &nbsp; 555-555-555 &nbsp;
    Test: &euro;/&eacute;</p>";

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);

$plain_text = '';
foreach ($xpath->query('//text()') as $textNode) {
    $plain_text .= $textNode->nodeValue;
}
$plain_text = preg_replace('/\s+/u', ' ', $plain_text);

var_dump($html, $plain_text);

两种解决方案都应该打印出这样的内容：

string(169) "<p>Hello &nbsp; Mr &nbsp; John Doe, you are now registered.
    Hello &nbsp; Mr &nbsp; John Doe, your phone number is &nbsp; 555-555-555 &nbsp;
    Test: &euro;/&eacute;</p>"
string(107) "Hello Mr John Doe, you are now registered. Hello Mr John Doe, your phone number is 555-555-555 Test: €/é"

使用php生成纯文本

1 个答案: