如何将Unicode字符串转换为HTML实体? (HEX
不是十进制)
例如,将Français
转换为Français
。
答案 0 :(得分:8)
您的字符串看起来像UCS-4
编码,您可以尝试
$first = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
$char = current($m);
$utf = iconv('UTF-8', 'UCS-4', $char);
return sprintf("&#x%s;", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $string);
输出
string 'Français' (length=13)
答案 1 :(得分:8)
对于related question中缺少的十六进制编码:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$entity = vsprintf('&#x%X;', unpack('N', $binary));
return $entity;
}, $input);
这与使用UTF-32BE然后unpack
和vsprintf
的@ Baba答案类似,以满足格式化需求。
如果您希望iconv
超过mb_convert_encoding
,则类似:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$binary = iconv('UTF-8', 'UTF-32BE', $utf8);
$entity = vsprintf('&#x%X;', unpack('N', $binary));
return $entity;
}, $input);
我发现这个字符串操作比Get hexcode of html entities更清晰。
答案 2 :(得分:4)
首先,当我最近遇到这个问题时,我通过确保我的代码文件,数据库连接和数据库表都是UTF-8来解决它然后,简单地回显文本的工作原理。如果必须使用htmlspecialchars()
而不是htmlentities()
来转义数据库的输出,那么UTF-8符号将保持不变,并且不会尝试转义。
想要记录替代解决方案,因为它为我解决了类似的问题。
我正在使用PHP的utf8_encode()
来逃避特殊的'字符。
我想将它们转换为HTML实体进行显示,我编写这段代码是因为我想尽可能避免使用iconv或类似的功能,因为并非所有环境都必须拥有它们(如果不是这样的话,请纠正我!)< / p>
$foo = 'This is my test string \u03b50';
echo unicode2html($foo);
function unicode2html($string) {
return preg_replace('/\\\\u([0-9a-z]{4})/', '&#x$1;', $string);
}
希望这可以帮助有需要的人: - )
答案 3 :(得分:0)
有关允许您执行以下操作的代码,请参阅How to get the character from unicode code point in PHP?:
echo "Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));
echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));
echo "\nGet numeric value of character as DEC string\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));
echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));
echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tchüß'));
echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tchüß'));
echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));
Get string from numeric DEC value
string(4) "ď"
string(2) "ď"
Get string from numeric HEX value
string(4) "ď"
string(2) "ď"
Get numeric value of character as DEC int
int(50319)
int(271)
Get numeric value of character as HEX string
string(4) "c48f"
string(3) "10f"
Encode / decode to DEC based HTML entities
string(15) "tchüß"
string(7) "tchüß"
Encode / decode to HEX based HTML entities
string(15) "tchüß"
string(7) "tchüß"
Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"
答案 4 :(得分:0)
您也可以使用 PHP 4.0.6+ (link to PHP doc) 支持的 mb_encode_numericentity
。
function unicode2html($value) {
return mb_encode_numericentity($value, [
// start codepoint
// | end codepoint
// | | offset
// | | | mask
0x0000, 0x001F, 0x0000, 0xFFFF,
0x0021, 0x002C, 0x0000, 0xFFFF,
0x002E, 0x002F, 0x0000, 0xFFFF,
0x003C, 0x003C, 0x0000, 0xFFFF,
0x003E, 0x003E, 0x0000, 0xFFFF,
0x0060, 0x0060, 0x0000, 0xFFFF,
0x0080, 0xFFFF, 0x0000, 0xFFFF
], 'UTF-8', true);
}
通过这种方式,还可以指示哪些字符范围要转换为十六进制实体,哪些要保留为字符。
用法示例:
$input = array(
'"Meno più, PIÙ o meno"',
'\'ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà\'',
'<script>alert("XSS");</script>',
'"`'
);
$output = array();
foreach ($input as $str)
$output[] = unicode2html($str)
结果:
$output = array(
'"Meno più, PIÙ o meno"',
''ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà'',
'<script>alert("XSS");</script>',
'"`'
);