Question

假设我在PHP上有这个字符串：

$str = '️';

或者JavaScript上的这个字符串：

var str = '️';

如果我执行utf8_encode($str)，则结果为\ud83c\udc04\ufe0f，但我希望它为1F004或1f004或\u1f004以便查找与该字符匹配的图像文件。

我已经做了很多在线搜索，寻找一种编码方式，我发现很多地方都有相同的术语用于非常不同的东西，看起来我想要的是“编码”一个字符串到UTF-32代码点，但我真的不知道如何命名我想要的名称，我只想使用PHP和/或JavaScript将此️转换为此1f004。

http://www.fileformat.info/info/unicode/char/1f004/index.htm

感谢。

Answer 1

JavaScript函数：

function e2u(str){
    str = str.replace(/\ufe0f|\u200d/gm, ''); // strips unicode variation selector and zero-width joiner
    var i = 0, c = 0, p = 0, r = [];
    while (i < str.length){
        c = str.charCodeAt(i++);
        if (p){
            r.push((65536+(p-55296<<10)+(c-56320)).toString(16));
            p = 0;
        } else if (55296 <= c && c <= 56319){
            p = c;
        } else {
            r.push(c.toString(16));
        }
    }
    return r.join('-');
}

Answer 2

您希望从字节流中获取unicode代码点，因此utf8_encode无济于事。我找到了一个实现here。

function utf8_to_unicode($c)
{
    $ord0 = ord($c{0}); if ($ord0>=0   && $ord0<=127) return $ord0;
    $ord1 = ord($c{1}); if ($ord0>=192 && $ord0<=223) return ($ord0-192)*64 + ($ord1-128);
    $ord2 = ord($c{2}); if ($ord0>=224 && $ord0<=239) return ($ord0-224)*4096 + ($ord1-128)*64 + ($ord2-128);
    $ord3 = ord($c{3}); if ($ord0>=240 && $ord0<=247) return ($ord0-240)*262144 + ($ord1-128)*4096 + ($ord2-128)*64 + ($ord3-128);
    return false;
}

var_dump( dechex(utf8_to_unicode('️')) ); // string(5) "1f004"

UTF-8与单字节ASCII编码兼容，因此$ord0 = ord($c{0}); if ($ord0>=0 && $ord0<=127) return $ord0;非常简单。大于127的代码点由多字节序列表示。接下来的1,920个字符需要两个字节进行编码，$ord1 = ord($c{1}); if ($ord0>=192 && $ord0<=223) return ($ord0-192)*64 + ($ord1-128);。第一个字节需要在192（11000000）和223（11011111）之间才能格式良好。第二个字节必须是10xxxxxx（十进制的128到191）。这里表示的第一个代码点是U + 0080，最后一个U + 07FF。

等等。

将表情符号编码为unicode代码点 - PHP / JS

2 个答案: