如何从PHP中的unicode代码点获取字符?

时间:2009-09-02 02:29:59

标签: php unicode character-encoding

例如,

如何获得与 U + 010F 对应的字符?

6 个答案:

答案 0 :(得分:20)

header('Content-Encoding: UTF-8');

function mb_html_entity_decode($string)
{
    if (extension_loaded('mbstring') === true)
    {
        mb_language('Neutral');
        mb_internal_encoding('UTF-8');
        mb_detect_order(array('UTF-8', 'ISO-8859-15', 'ISO-8859-1', 'ASCII'));

        return mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES');
    }

    return html_entity_decode($string, ENT_COMPAT, 'UTF-8');
}

function mb_ord($string)
{
    if (extension_loaded('mbstring') === true)
    {
        mb_language('Neutral');
        mb_internal_encoding('UTF-8');
        mb_detect_order(array('UTF-8', 'ISO-8859-15', 'ISO-8859-1', 'ASCII'));

        $result = unpack('N', mb_convert_encoding($string, 'UCS-4BE', 'UTF-8'));

        if (is_array($result) === true)
        {
            return $result[1];
        }
    }

    return ord($string);
}

function mb_chr($string)
{
    return mb_html_entity_decode('&#' . intval($string) . ';');
}

var_dump(hexdec('010F'));

var_dump(mb_ord('ó')); // 243
var_dump(mb_chr(243)); // ó

答案 1 :(得分:4)

我刚为polyfillord缺少多字节版本写了chr,并注意以下几点:

  • 只有在函数mb_ordmb_chr不存在的情况下才定义它们。如果它们确实存在于您的框架或PHP的未来版本中,则将忽略polyfill。

  • 它使用广泛使用的mbstring扩展程序进行转换。如果未加载mbstring扩展程序,则会使用iconv扩展名。


编辑:

我将HTML实体编码/解码和编码/解码的功能添加到JSON格式以及一些如何使用这些功能的演示代码


代码

if (!function_exists('codepoint_encode')) {
    function codepoint_encode($str) {
        return substr(json_encode($str), 1, -1);
    }
}

if (!function_exists('codepoint_decode')) {
    function codepoint_decode($str) {
        return json_decode(sprintf('"%s"', $str));
    }
}

if (!function_exists('mb_internal_encoding')) {
    function mb_internal_encoding($encoding = NULL) {
        return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding);
    }
}

if (!function_exists('mb_convert_encoding')) {
    function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) {
        return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str);
    }
}

if (!function_exists('mb_chr')) {
    function mb_chr($ord, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            return pack("N", $ord);
        } else {
            return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE');
        }
    }
}

if (!function_exists('mb_ord')) {
    function mb_ord($char, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char);
            return $ord;
        } else {
            return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE');
        }
    }
}

if (!function_exists('mb_htmlentities')) {
    function mb_htmlentities($string, $hex = true, $encoding = 'UTF-8') {
        return preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) use ($hex) {
            return sprintf($hex ? '&#x%X;' : '&#%d;', mb_ord($match[0]));
        }, $string);
    }
}

if (!function_exists('mb_html_entity_decode')) {
    function mb_html_entity_decode($string, $flags = null, $encoding = 'UTF-8') {
        return html_entity_decode($string, ($flags === NULL) ? ENT_COMPAT | ENT_HTML401 : $flags, $encoding);
    }
}

如何使用

echo "Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));

echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));

echo "\nGet numeric value of character as DEC string\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));

echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));

echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tchüß'));

echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tchüß'));

echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));

输出

Get string from numeric DEC value
string(4) "ď"
string(2) "ď"

Get string from numeric HEX value
string(4) "ď"
string(2) "ď"

Get numeric value of character as DEC int
int(50319)
int(271)

Get numeric value of character as HEX string
string(4) "c48f"
string(3) "10f"

Encode / decode to DEC based HTML entities
string(15) "tchüß"
string(7) "tchüß"

Encode / decode to HEX based HTML entities
string(15) "tchüß"
string(7) "tchüß"

Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"

答案 2 :(得分:2)

IntlChar是一个基于ICU的新内置类,使用PHP / 7发布,可以解决这个问题:

  

IntlChar提供对许多实用程序方法的访问,这些方法可用于访问有关Unicode字符的信息。

// PHP 7.0 and later
var_dump(
  "\u{010F}" === IntlChar::chr(0x010F),
  0x010F === IntlChar::ord("\u{010F}")
);

// PHP 7.2.0-dev
var_dump(
  "\u{010F}" === mb_chr(0x010F, "UTF-8"),
  0x010F === mb_ord("\u{010F}", "UTF-8")
);

答案 3 :(得分:2)

如果这对任何人都有用,PHP 7.2现在已将mb_ordmb_chr等效项添加到ordchr。例如,以下代码将在PHP 7.2中使用

$charPoint = "U+010F";
echo mb_chr(hexdec(substr($charPoint,2)); // prints ď

这有两个副作用(a)一个不需要实现它们自己和(b)如果一个已经实现了它们自己需要将它包装在if (function_exists('mb_chr'))

答案 4 :(得分:0)

如果您控制字符串的UTF-8编码(根据拉丁语和其他欧洲标准的建议),您只需要

html_entity_decode($string, ENT_COMPAT, 'UTF-8');

参见php man的示例#1。如果您的字符串是标记语言(!),您可以将第二个参数更改为ENT_NOQUOTES等,支付注意,使用ENT_XHTML等。

答案 5 :(得分:0)

<?php

function chr_utf8($n,$f='C*'){
return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'')));
}

echo chr_utf8(hexdec('010F'));

// Output the UTF-8 character corresponding to U+010F