Question

给定一些HTML，我将http://php.net/manual/en/class.domdocument.php类应用于它，保存它，并偶尔插入Â符号。它似乎发生在具有单个空白区域（与 相对）的标签上，但似乎不是绝对的（只有第一个<span>元素表现出这种现象）。我按照PHP DOMDocument->getElementByID adding Â in place of empty <span>的建议在显示生成的HTML时尝试添加编码，但问题仍然存在。导致这种情况的原因是什么以及如何防止它？

如果你有兴趣我为什么这样做。我有一个应用程序，我用文本替换HTML图像。将HTML从Outlook电子邮件复制并粘贴到TinyMCE编辑器，然后解析HTML时，我会遇到这种情况。

<?php
$message = <<<EOT
<p>Start</p>
<p> </p>
<p> </p>
<p></p>
<p class="MsoNormal">
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;"> <br /></span>
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;"> <br /></span>
<span style="font-size:10pt;font-family:Arial, 'sans-serif';color:#000080;">Phone: (444) 777-7777</span>
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;"> <br /></span>
</p>
<p>End</p>
EOT;
    echo('<p>Initial HTML:</p> '.$message);
    $message_encoded = utf8_encode($message);
    $doc = new DOMDocument();
    $doc->loadHTML($message);
    $body = $doc->getElementsByTagName('body')->item(0);
    $message=$doc->saveHTML($body);
    echo('<p>Final HTML:</p> '.$message);
    echo('<p>Initial HTML encoded:</p> '.$message_encoded);
    $doc->loadHTML($message_encoded);
    $body = $doc->getElementsByTagName('body')->item(0);
    $message_encoded=$doc->saveHTML($body);
    echo('<p>Final HTML:</p> '.$message_encoded);
?>

输出：

<p>Initial HTML:</p> <p>Start</p>
<p> </p>
<p> </p>
<p></p>
<p class="MsoNormal">
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;"> <br /></span>
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;"> <br /></span>
<span style="font-size:10pt;font-family:Arial, 'sans-serif';color:#000080;">Phone: (444) 777-7777</span>
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;"> <br /></span>
</p>
<p>End</p><p>Final HTML:</p> <body>
<p>Start</p>
<p>Â </p>
<p>Â </p>
<p></p>
<p class="MsoNormal">
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;">Â <br></span>
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;"> <br></span>
<span style="font-size:10pt;font-family:Arial, 'sans-serif';color:#000080;">Phone: (444)Â 777-7777</span>
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;"> <br></span>
</p>
<p>End</p>
</body><p>Initial HTML encoded:</p> <p>Start</p>
<p>Â </p>
<p>Â </p>
<p></p>
<p class="MsoNormal">
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;">Â <br /></span>
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;"> <br /></span>
<span style="font-size:10pt;font-family:Arial, 'sans-serif';color:#000080;">Phone: (444)Â 777-7777</span>
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;"> <br /></span>
</p>
<p>End</p><p>Final HTML:</p> <body>
<p>Start</p>
<p>ÃÂ </p>
<p>ÃÂ </p>
<p></p>
<p class="MsoNormal">
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;">ÃÂ <br></span>
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;"> <br></span>
<span style="font-size:10pt;font-family:Arial, 'sans-serif';color:#000080;">Phone: (444)ÃÂ 777-7777</span>
<span style="font-size:10pt;font-family:Calibri, 'sans-serif';color:#000080;"> <br></span>
</p>
<p>End</p>
</body>

Answer 1

PHP DOM扩展在utf8中运行。类似的字节编码问题适用于XML文档。您当前的编码是ISO-8859-1吗？

根据http://php.net/manual/en/intro.dom.php的建议：

DOM扩展使用UTF-8编码。使用utf8_encode（）和utf8_decode（）来处理ISO-8859-1编码的文本或其他编码的Iconv。

尝试按如下方式修改该部分：

<p>End</p>
EOT;
    $message = utf8_encode($message); // this should fix it.
    echo('<p>Initial HTML:</p> '.$message);

还将脚本输出设置为UTF8并将文档保存在UTF8中，以解决许多将来与编码相关的问题。

希望有所帮助。

Answer 2

正如DeDee所说，您的问题是由iso-8859-1字符转换为utf-8引起的。请注意，空格也被视为字符。

有三种解决方案：

确保输入为UTF-8，
将服务器的字符集设置为iso-8859-1
将所有字符正确转换为UTF-8。

我个人推荐1，不推荐2。

完成1）

确保使用文本编辑器（如Notepad ++）来创建文件。 Don＆＃t> 使用Microsoft Word等文本编辑器。这里的经验法则是确保您用于创建软件的任何编辑器都使用UTF-8编码。

完成2）

在顶级.htaccess文件中：

AddDefaultCharset iso-8859-1

在HTML文件的<head>中：

<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">

完成3）

以下是您可以使用的代码段：

//Convert character encoding to UTF-8
function replace_non_utf_characters($string) {
    /**
     * This array consists of $key=>$value pairs, where $key
     * is the character needing to be replaced, and $value is
     * the character $key is replaced by. Add characters to
     * this array as needed.
     */
    $replacement_array = array(
            chr(145) => "'", //the chr(#) are all Microsoft-encoded equivalents (e.g. open/close "smart" quotes)
            chr(146) => "'",
            chr(147) => "\"",
            chr(148) => "\"",
            chr(149) => "&#8226;",
            chr(150) => "&ndash;",
            chr(151) => "&mdash;",
            chr(153) => "&#8482;",
            chr(169) => "&copy;",
            chr(174) => "&reg;"
        );
    foreach($replacement_array as $key=>$replacement) {
        $string = str_replace($key, $replacement, $string);
    }
    //Force UTF-8 encoding, so that there will always be an output
    return mb_convert_encoding(str_replace(chr(194), '', mb_convert_encoding($string, "UTF-8", 'HTML-ENTITIES')), 'HTML-ENTITIES');
}

Answer 3

如果你有兴趣我为什么这样做。我有一个申请我在哪里用文本替换HTML图像。复制和时将HTML从Outlook电子邮件粘贴到TinyMCE编辑器，然后在解析HTML时，我会遇到这种情况。

Microsoft Word和Outlook会在切割并粘贴到TinyMCE时添加一堆垃圾。只需添加TinyMCE插件“粘贴”即可。还是需要处理任何

Answer 4

尝试在$message_encoded = mb_convert_encoding($message_encoded , 'HTML-ENTITIES', 'UTF-8');

之后添加$message_encoded = $doc->saveHTML($body);

Answer 5

这对我有用：

$htm = str_replace("&nbsp;"," ",$htm);
$doc->loadHTML($htm) ;

这是我摆脱 Â 符号的唯一方法。

PHP domdocument插入符号

5 个答案:

完成1）

完成2）

完成3）