我试图理解Perl如何处理unicode。
use feature qw(say);
use strict;
use warnings;
use Encode qw(encode);
say unpack "H*", pack("U", 0xff);
say unpack "H*", encode( 'UTF-8', chr 0xff );
输出:
ff
c3bf
使用pack时,为什么我会获得ff
而不是c3bf
?
答案 0 :(得分:2)
为什么我在使用pack时会得到ff而不是c3bf?
这是因为pack创建了一个字符串,而不是字节串。
> perl -MDevel::Peek -e 'Dump(pack("U", 0xff));'
SV = PV(0x13a6d18) at 0x13d2ce8
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
PV = 0xa6d298 "\303\277"\0 [UTF8 "\x{ff}"]
CUR = 2
LEN = 32
因此,unpack(“H *”)不会查看该字符串的字节值,而是查看它的(截断的)字符值。如果你这样做:
say unpack "H*", encode("UTF-8", pack("U", 0xff));
然后你会得到预期的结果。
另见this thread。
答案 1 :(得分:2)
pack('U', 0xFF)
只是一种奇怪的做法
chr(0xFF)
所以
"\xFF" returns chars FF
chr(0xFF) returns chars FF
pack('U', 0xFF) returns chars FF
"\xC3\xBF" returns chars C3 BF
encode('UTF-8', chr(0xFF)) returns chars C3 BF
encode('UTF-8', pack('U', 0xFF)) returns chars C3 BF
所以
say unpack "H*", "\xFF"; outputs ff
say unpack "H*", chr(0xFF); outputs ff
say unpack "H*", pack('U', 0xFF); outputs ff
say unpack "H*", "\xC3\xBF"; outputs c3bf
say unpack "H*", encode('UTF-8', pack('U', 0xFF)); outputs c3bf
say unpack "H*", encode('UTF-8', chr(0xFF)); outputs c3bf