使用php生成纯文本

时间:2017-07-22 09:54:23

标签: php string

我使用的服务最终会生成一个字符串。字符串通常如下:

Hello   Mr   John Doe, you are now registered \t.
Hello &nbsb; Mr   John Doe, your phone number is &nbsb; 555-555-555 &nbs; \n

我需要删除所有html实体以及所有\ t和\ n等等。

我可以使用html_entity_decode删除空格,并使用str_replace删除\t\n,但是有更通用的方法吗?有些东西会让你确定字符串中只有字母字符(一些不包含代码的字符串)。

1 个答案:

答案 0 :(得分:2)

如果我理解你的情况,你基本上想要从HTML转换为纯文本。

根据输入的复杂性以及所需的稳健性和准确性,您有以下几种选择:

  • 使用strip_tags()删除HTML代码,mb_convert_encoding()HTML-ENTITIES作为源编码来解码实体,使用strtr()preg_replace()进行额外的替代品:

    $html = "<p>Hello &nbsp; Mr &nbsp; John Doe, you are now registered.
        Hello &nbsp; Mr &nbsp; John Doe, your phone number is &nbsp; 555-555-555 &nbsp;
        Test: &euro;/&eacute;</p>";
    
    $plain_text = $html;
    $plain_text = strip_tags($plain_text);
    $plain_text = mb_convert_encoding($plain_text, 'UTF-8', 'HTML-ENTITIES');
    $plain_text = strtr($plain_text, [
        "\t" => ' ',
        "\r" => ' ',
        "\n" => ' ',
    ]);
    $plain_text = preg_replace('/\s+/u', ' ', $plain_text);
    
    var_dump($html, $plain_text);
    
  • 使用适当的DOM解析器,加上可能preg_replace()进行进一步调整:

    $html = "<p>Hello &nbsp; Mr &nbsp; John Doe, you are now registered.
        Hello &nbsp; Mr &nbsp; John Doe, your phone number is &nbsp; 555-555-555 &nbsp;
        Test: &euro;/&eacute;</p>";
    
    $dom = new DOMDocument();
    libxml_use_internal_errors(true);
    $dom->loadHTML($html);
    libxml_use_internal_errors(false);
    $xpath = new DOMXPath($dom);
    
    $plain_text = '';
    foreach ($xpath->query('//text()') as $textNode) {
        $plain_text .= $textNode->nodeValue;
    }
    $plain_text = preg_replace('/\s+/u', ' ', $plain_text);
    
    var_dump($html, $plain_text);
    

两种解决方案都应该打印出这样的内容:

string(169) "<p>Hello &nbsp; Mr &nbsp; John Doe, you are now registered.
    Hello &nbsp; Mr &nbsp; John Doe, your phone number is &nbsp; 555-555-555 &nbsp;
    Test: &euro;/&eacute;</p>"
string(107) "Hello Mr John Doe, you are now registered. Hello Mr John Doe, your phone number is 555-555-555 Test: €/é"