我有一些希伯来网站包含字符引用,如:נוף
如果我将文件保存为.html并以UTF-8编码查看,我只能查看这些字母。
如果我尝试将其作为常规文本文件打开,则UTF-8编码不会显示正确的输出。
我注意到如果我打开一个文本编辑器并用UTF-8编写希伯来语,在这个例子中每个字符占用两个字节而不是4个字节行(ו
)
任何想法,如果这是UTF-16或任何其他类型的UTF字母表示?
如果可能,我如何将其转换为普通字母?
使用最新的PHP版本。
答案 0 :(得分:6)
通过以十进制(&#n;
)或十六进制(&#xn;
)表示法指定该字符的代码点,那些引用ISO 10646中字符的character references。
您可以使用html_entity_decode
解码此类字符引用以及entities defined for HTML 4的实体引用,以便其他引用,例如<
,>
,&
也将被解码:
$str = html_entity_decode($str, ENT_NOQUOTES, 'UTF-8');
如果您只想解码数字字符引用,可以使用:
function html_dereference($match) {
if (strtolower($match[1][0]) === 'x') {
$codepoint = intval(substr($match[1], 1), 16);
} else {
$codepoint = intval($match[1], 10);
}
return mb_convert_encoding(pack('N', $codepoint), 'UTF-8', 'UTF-32BE');
}
$str = preg_replace_callback('/&#(x[0-9a-f]+|[0-9]+);/i', 'html_dereference', $str);
正如YuriKolovsky和thirtydot在another question中指出的那样,浏览器供应商似乎“默默地”就字符引用映射达成了一致意见,这与规范不同,并且完全无证。
似乎有一些字符引用通常会映射到Latin 1 supplement,但实际上映射到不同的字符。这是因为映射是由映射来自Windows-1252而不是ISO 8859-1的字符而构建的,其中构建了Unicode字符集。 Jukka Korpela写了extensive article on this topic。
现在这里是上面提到的处理这个怪癖的函数的扩展:
function html_character_reference_decode($string, $encoding='UTF-8', $fixMappingBug=true) {
$deref = function($match) use ($encoding, $fixMappingBug) {
if (strtolower($match[1][0]) === "x") {
$codepoint = intval(substr($match[1], 1), 16);
} else {
$codepoint = intval($match[1], 10);
}
// @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
if ($fixMappingBug && $codepoint >= 130 && $codepoint <= 159) {
$mapping = array(
8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249,
338, 141, 142, 143, 144, 8216, 8217, 8220, 8221, 8226,
8211, 8212, 732, 8482, 353, 8250, 339, 157, 158, 376);
$codepoint = $mapping[$codepoint-130];
}
return mb_convert_encoding(pack("N", $codepoint), $encoding, "UTF-32BE");
};
return preg_replace_callback('/&#(x[0-9a-f]+|[0-9]+);/i', $deref, $string);
}
如果anonymous functions不可用(在5.3.0中引入),您还可以使用create_function
:
$deref = create_function('$match', '
$encoding = '.var_export($encoding, true).';
$fixMappingBug = '.var_export($fixMappingBug, true).';
if (strtolower($match[1][0]) === "x") {
$codepoint = intval(substr($match[1], 1), 16);
} else {
$codepoint = intval($match[1], 10);
}
// @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
if ($fixMappingBug && $codepoint >= 130 && $codepoint <= 159) {
$mapping = array(
8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249,
338, 141, 142, 143, 144, 8216, 8217, 8220, 8221, 8226,
8211, 8212, 732, 8482, 353, 8250, 339, 157, 158, 376);
$codepoint = $mapping[$codepoint-130];
}
return mb_convert_encoding(pack("N", $codepoint), $encoding, "UTF-32BE");
');
这是另一个尝试遵守behavior of HTML 5:
的功能function html5_decode($string, $flags=ENT_COMPAT, $charset='UTF-8') {
$deref = function($match) use ($flags, $charset) {
if ($match[1][0] === '#') {
if (strtolower($match[1][0]) === '#') {
$codepoint = intval(substr($match[1], 2), 16);
} else {
$codepoint = intval(substr($match[1], 1), 10);
}
// HTML 5 specific behavior
// @see http://dev.w3.org/html5/spec/tokenization.html#tokenizing-character-references
// handle Windows-1252 mismapping
// @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
// @see http://dev.w3.org/html5/spec/tokenization.html#table-charref-overrides
$overrides = array(
0x00=>0xFFFD,0x80=>0x20AC,0x82=>0x201A,0x83=>0x0192,0x84=>0x201E,
0x85=>0x2026,0x86=>0x2020,0x87=>0x2021,0x88=>0x02C6,0x89=>0x2030,
0x8A=>0x0160,0x8B=>0x2039,0x8C=>0x0152,0x8E=>0x017D,0x91=>0x2018,
0x92=>0x2019,0x93=>0x201C,0x94=>0x201D,0x95=>0x2022,0x96=>0x2013,
0x97=>0x2014,0x98=>0x02DC,0x99=>0x2122,0x9A=>0x0161,0x9B=>0x203A,
0x9C=>0x0153,0x9E=>0x017E,0x9F=>0x0178);
if (isset($windows1252Mapping[$codepoint])) {
$codepoint = $windows1252Mapping[$codepoint];
}
if (($codepoint >= 0xD800 && $codepoint <= 0xDFFF) || $codepoint > 0x10FFFF) {
$codepoint = 0xFFFD;
}
if (($codepoint >= 0x0001 && $codepoint <= 0x0008) ||
($codepoint >= 0x000E && $codepoint <= 0x001F) ||
($codepoint >= 0x007F && $codepoint <= 0x009F) ||
($codepoint >= 0xFDD0 && $codepoint <= 0xFDEF) ||
in_array($codepoint, array(
0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,
0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,
0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,
0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, 0x10FFFF))) {
$codepoint = 0xFFFD;
}
return mb_convert_encoding(pack("N", $codepoint), $charset, "UTF-32BE");
} else {
return html_entity_decode($match[0], $flags, $charset);
}
};
return preg_replace_callback('/&(#(?:x[0-9a-f]+|[0-9]+)|[A-Za-z0-9]+);/i', $deref, $string);
}
我还注意到,在PHP 5.4.0中,html_entity_decode
function为HTML 5行为添加了另一个名为 ENT_HTML5 的标志。
答案 1 :(得分:4)
这些是XML Character Reference。您想使用html_entity_decode()
解码它们:
$string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
有关详细信息,您可以在Google上搜索相关实体。请参阅以下几个示例: