DOMDocument输出哪些字符实体?

时间:2015-01-23 22:41:05

标签: php utf-8 domdocument html-entities

PHP的DOMDocument类混淆了UTF-8输入unless you prepare your input first

例如,此代码

<?php
echo mb_internal_encoding()."\n\n";

$str = '’';
$dom = new DOMDocument;
$dom->loadHTML($str);
echo $dom->saveHTML();

产生此输出

UTF-8
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>&acirc;&#128;&#153;</p></body></html>

&acirc;&#128;&#153;应为&rsquo;

如果你不使用the fix,我想知道DOMDocument可能产生的所有字符实体,如&acirc;。某个地方有名单吗?它是在PHP源代码中吗? LibXML源代码?

1 个答案:

答案 0 :(得分:0)

我想到了一种在不阅读任何参考或源代码的情况下找到的方法:

<?php

$str = '';

for ($i = 1; $i < 256; $i++) {

   $str .= chr($i)."\n";
}

$str .= chr(0)."\n";

$dom = new DOMDocument;
$dom->loadHTML($str);
echo $dom->saveHTML();

如果您需要一个正确的列表,那么我建议您在自己的系统上运行它以获取自己的列表,以防它在不同版本的PHP等中有所不同。

期待很多警告信息,但没有错误。

这是我得到的输出,除了我用文本编辑器删除了非字符实体:

&amp;
&#128;
&#129;
&#130;
&#131;
&#132;
&#133;
&#134;
&#135;
&#136;
&#137;
&#138;
&#139;
&#140;
&#141;
&#142;
&#143;
&#144;
&#145;
&#146;
&#147;
&#148;
&#149;
&#150;
&#151;
&#152;
&#153;
&#154;
&#155;
&#156;
&#157;
&#158;
&#159;
&nbsp;
&iexcl;
&cent;
&pound;
&curren;
&yen;
&brvbar;
&sect;
&uml;
&copy;
&ordf;
&laquo;
&not;
&shy;
&reg;
&macr;
&deg;
&plusmn;
&sup2;
&sup3;
&acute;
&micro;
&para;
&middot;
&cedil;
&sup1;
&ordm;
&raquo;
&frac14;
&frac12;
&frac34;
&iquest;
&Agrave;
&Aacute;
&Acirc;
&Atilde;
&Auml;
&Aring;
&AElig;
&Ccedil;
&Egrave;
&Eacute;
&Ecirc;
&Euml;
&Igrave;
&Iacute;
&Icirc;
&Iuml;
&ETH;
&Ntilde;
&Ograve;
&Oacute;
&Ocirc;
&Otilde;
&Ouml;
&times;
&Oslash;
&Ugrave;
&Uacute;
&Ucirc;
&Uuml;
&Yacute;
&THORN;
&szlig;
&agrave;
&aacute;
&acirc;
&atilde;
&auml;
&aring;
&aelig;
&ccedil;
&egrave;
&eacute;
&ecirc;
&euml;
&igrave;
&iacute;
&icirc;
&iuml;
&eth;
&ntilde;
&ograve;
&oacute;
&ocirc;
&otilde;
&ouml;
&divide;
&oslash;
&ugrave;
&uacute;
&ucirc;
&uuml;
&yacute;
&thorn;
&yuml;