正在寻找与PHP ord()函数兼容的兼容Unicode的替代方案

时间:2012-07-03 04:41:59

标签: php unicode

经过相当多的搜索和测试后,我找到的与PHP ord()函数的Unicode兼容替代方法的最简单方法是:

$utf8Character = 'Ą';
list(, $ord) = unpack('N', mb_convert_encoding($utf8Character, 'UCS-4BE', 'UTF-8'));
echo $ord; # 260

我发现了here。但是,it has been mentioned这种方法相当慢。有谁知道一个更有效的方法几乎一样简单? UCS-4BE是什么意思?

3 个答案:

答案 0 :(得分:3)

您也可以使用iconv()来实现此功能,但您所拥有的mb_convert_encoding方法对我来说是合理的。只需确保$utf8Character是一个单个字符,而不是一个长字符串,它的表现会相当不错。

UCS-4BE是一种Unicode编码,它将每个字符存储为32位(4字节)整数。这说明了“UCS-4”; “BE”前缀表示整数以big-endian顺序存储。这种编码的原因是,与较小的编码(如UTF-8或UTF-16)不同,它不需要代理对 - 每个字符都是固定大小。

答案 1 :(得分:2)

我刚为polyfillord缺少多字节版本写了chr,并注意以下几点:

  • 仅当函数mb_ordmb_chr尚不存在时才定义它们。如果它们确实存在于您的框架或PHP的未来版本中,则将忽略polyfill。

  • 它使用广泛使用的mbstring扩展程序进行转换。如果未加载mbstring扩展程序,则会使用iconv扩展名。

我还为HTML实体编码/解码和编码/解码添加了JSON格式的函数,以及一些如何使用这些函数的演示代码


代码

if (!function_exists('codepoint_encode')) {
    function codepoint_encode($str) {
        return substr(json_encode($str), 1, -1);
    }
}

if (!function_exists('codepoint_decode')) {
    function codepoint_decode($str) {
        return json_decode(sprintf('"%s"', $str));
    }
}

if (!function_exists('mb_internal_encoding')) {
    function mb_internal_encoding($encoding = NULL) {
        return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding);
    }
}

if (!function_exists('mb_convert_encoding')) {
    function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) {
        return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str);
    }
}

if (!function_exists('mb_chr')) {
    function mb_chr($ord, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            return pack("N", $ord);
        } else {
            return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE');
        }
    }
}

if (!function_exists('mb_ord')) {
    function mb_ord($char, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char);
            return $ord;
        } else {
            return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE');
        }
    }
}

if (!function_exists('mb_htmlentities')) {
    function mb_htmlentities($string, $hex = true, $encoding = 'UTF-8') {
        return preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) use ($hex) {
            return sprintf($hex ? '&#x%X;' : '&#%d;', mb_ord($match[0]));
        }, $string);
    }
}

if (!function_exists('mb_html_entity_decode')) {
    function mb_html_entity_decode($string, $flags = null, $encoding = 'UTF-8') {
        return html_entity_decode($string, ($flags === NULL) ? ENT_COMPAT | ENT_HTML401 : $flags, $encoding);
    }
}

如何使用

echo "Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));

echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));

echo "\nGet numeric value of character as DEC int\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));

echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));

echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tchüß'));

echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tchüß'));

echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));

输出

Get string from numeric DEC value
string(4) "ď"
string(2) "ď"

Get string from numeric HEX value
string(4) "ď"
string(2) "ď"

Get numeric value of character as DEC string
int(50319)
int(271)

Get numeric value of character as HEX string
string(4) "c48f"
string(3) "10f"

Encode / decode to DEC based HTML entities
string(15) "tchüß"
string(7) "tchüß"

Encode / decode to HEX based HTML entities
string(15) "tchüß"
string(7) "tchüß"

Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"

答案 2 :(得分:0)

这是使用该公式的我的字符串到int转换。你也可以爆炸字符串并使用array_reduce来总结它。

/**
 * @param $string
 * @param int $index
 * @return mixed
 */
function convertEncoding($string, $index = 0, $carryResult = 0)
{
    $remainder = mb_strlen(mb_substr($string, $index));
    while ($remainder) {
        $currentCharacter = $string[$index];
        list(, $ord) = unpack('N', mb_convert_encoding($currentCharacter, 'UCS-4BE', 'UTF-8'));
        return $this->convertEncoding($string, $index += 1, $ord += $carryResult);
    }
    return $carryResult;
}