我试图了解Perl中的UTF8。
我有以下字符串Alizéh。如果我查找此字符串的十六进制,则会从https://onlineutf8tools.com/convert-utf8-to-hexadecimal获得416c697ac3a968(这与该字符串的原始来源匹配)。
所以我认为打包十六进制并将其编码为utf8应该会产生unicode字符串。但是它产生了非常不同的东西。
有人能解释我在做什么错吗?
这是一个简单的测试程序,以显示我的工作。
#!/usr/bin/perl
use strict;
use warnings;
use Text::Unaccent;
use Encode;
use utf8;
binmode STDOUT, ':encoding(UTF-8)';
print "First test that the utf8 string Alizéh prints as expected\n\n";
print "=========================================== Hex to utf8 test start\n";
my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";
print "=========================================== Hex to utf8 test finish\n\n";
print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";
my ($hex) = unpack("H*", $utf8FromCode);
print "Hex of this string is now $hex\n";
print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);
$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now $hex\n";
print "=========================================== utf8 from code test finish\n\n";
此打印:
First test that the utf8 string Alizéh prints as expected
=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish
=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now 416c697ae968
=========================================== utf8 from code test finish
关于如何获取UTF8字符串的十六进制值并将其转换为perl中有效的UTF8标量的任何技巧?
在此扩展版本中,我将进一步解释一些奇怪之处
#!/usr/bin/perl
use strict;
use warnings;
use Text::Unaccent;
use Encode;
use utf8;
binmode STDOUT, ':encoding(UTF-8)';
print "First test that the utf8 string Alizéh prints as expected\n\n";
print "=========================================== Hex to utf8 test start\n";
my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";
print "=========================================== Hex to utf8 test finish\n\n";
print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";
my ($hex) = unpack("H*", $utf8FromCode);
print "Hex of this string is now $hex\n";
print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);
$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now $hex\n";
print "=========================================== utf8 from code test finish\n\n";
print "=========================================== Unaccent test start\n";
my $plaintest = unac_string('utf8', "Alizéh");
print "Alizéh passed to the unaccent gives $plaintest\n";
my $cleanpackedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "Packed version of the hex string prints as $cleanpackedHexIntoPlainString\n";
my $packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);
print "Unaccenting the packed version gives $packedtest\n";
utf8::encode($cleanpackedHexIntoPlainString);
print "encoding the packed version it now prints as $cleanpackedHexIntoPlainString\n";
$packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);
print "Now unaccenting the packed version gives $packedtest\n";
print "=========================================== Unaccent test finish\n\n";
此打印:
First test that the utf8 string Alizéh prints as expected
=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish
=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now 416c697ae968
=========================================== utf8 from code test finish
=========================================== Unaccent test start
Alizéh passed to the unaccent gives Alizeh
Packed version of the hex string prints as Alizéh
Unaccenting the packed version gives Alizeh
encoding the packed version it now prints as Alizéh
Now unaccenting the packed version gives AlizA©h
=========================================== Unaccent test finish
在此测试中,unaccent库似乎接受十六进制字符串的压缩版本。我不确定为什么,有人可以帮我了解为什么行得通吗?
答案 0 :(得分:5)
Unicode字符串是Perl中的一等值,您无需跳过这些箍。您只需要识别并跟踪何时有字节以及何时有字符,Perl不会为您区分,并且所有字节字符串也是有效的字符串。确实,您正在对字符串进行双重编码,这些字符串仍然有效,因为它们表示UTF-8编码的字节(对应于这些字符)都是UTF-8编码的字节。
use utf8;
将从UTF-8解码您的源代码,因此通过声明以下文字字符串已经是unicode字符串,并且可以将其传递给正确接受字符的任何API。要从一串UTF-8字节中获取相同的内容(就像通过打包字节的十六进制表示而产生的那样),请使用decode from Encode(或我的nicer wrapper)。
use strict;
use warnings;
use utf8;
use Encode 'decode';
my $str = 'Alizéh'; # already decoded
my $hex = '416c697ac3a968';
my $bytes = pack 'H*', $hex;
my $chars = decode 'UTF-8', $bytes;
需要将Unicode字符串编码为UTF-8,才能输出到需要字节的内容,例如STDOUT;可以将:encoding(UTF-8)
层应用于此类句柄以自动执行此操作,也可以将其应用于从输入句柄自动解码。应使用的内容的确切性质完全取决于您的角色来自何处以及去向何处。有关可用选项的过多信息,请参见this answer。
use Encode 'encode';
print encode 'UTF-8', "$chars\n";
binmode *STDOUT, ':encoding(UTF-8)'; # warning: global effect
print "$chars\n";