我使用的服务最终会生成一个字符串。字符串通常如下:
Hello Mr John Doe, you are now registered \t.
Hello &nbsb; Mr John Doe, your phone number is &nbsb; 555-555-555 &nbs; \n
我需要删除所有html实体以及所有\ t和\ n等等。
我可以使用html_entity_decode
删除空格,并使用str_replace
删除\t
或\n
,但是有更通用的方法吗?有些东西会让你确定字符串中只有字母字符(一些不包含代码的字符串)。
答案 0 :(得分:2)
如果我理解你的情况,你基本上想要从HTML转换为纯文本。
根据输入的复杂性以及所需的稳健性和准确性,您有以下几种选择:
使用strip_tags()删除HTML代码,mb_convert_encoding()以HTML-ENTITIES
作为源编码来解码实体,使用strtr()或preg_replace()进行额外的替代品:
$html = "<p>Hello Mr John Doe, you are now registered.
Hello Mr John Doe, your phone number is 555-555-555
Test: €/é</p>";
$plain_text = $html;
$plain_text = strip_tags($plain_text);
$plain_text = mb_convert_encoding($plain_text, 'UTF-8', 'HTML-ENTITIES');
$plain_text = strtr($plain_text, [
"\t" => ' ',
"\r" => ' ',
"\n" => ' ',
]);
$plain_text = preg_replace('/\s+/u', ' ', $plain_text);
var_dump($html, $plain_text);
使用适当的DOM解析器,加上可能preg_replace()
进行进一步调整:
$html = "<p>Hello Mr John Doe, you are now registered.
Hello Mr John Doe, your phone number is 555-555-555
Test: €/é</p>";
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$plain_text = '';
foreach ($xpath->query('//text()') as $textNode) {
$plain_text .= $textNode->nodeValue;
}
$plain_text = preg_replace('/\s+/u', ' ', $plain_text);
var_dump($html, $plain_text);
两种解决方案都应该打印出这样的内容:
string(169) "<p>Hello Mr John Doe, you are now registered.
Hello Mr John Doe, your phone number is 555-555-555
Test: €/é</p>"
string(107) "Hello Mr John Doe, you are now registered. Hello Mr John Doe, your phone number is 555-555-555 Test: €/é"