使用JSON库处理时会破坏UTF-8字符(可能这类似于Problem with decoding unicode JSON in perl,但设置binmode只会产生另一个问题)。
我已将问题缩小到以下示例:
(hlovdal) localhost:/tmp/my_test>cat my_test.pl
#!/usr/bin/perl -w
use strict;
use warnings;
use JSON;
use File::Slurp;
use Getopt::Long;
use Encode;
my $set_binmode = 0;
GetOptions("set-binmode" => \$set_binmode);
if ($set_binmode) {
binmode(STDIN, ":encoding(UTF-8)");
binmode(STDOUT, ":encoding(UTF-8)");
binmode(STDERR, ":encoding(UTF-8)");
}
sub check {
my $text = shift;
return "is_utf8(): " . (Encode::is_utf8($text) ? "1" : "0") . ", is_utf8(1): " . (Encode::is_utf8($text, 1) ? "1" : "0"). ". ";
}
my $my_test = "hei på deg";
my $json_text = read_file('my_test.json');
my $hash_ref = JSON->new->utf8->decode($json_text);
print check($my_test), "\$my_test = $my_test\n";
print check($json_text), "\$json_text = $json_text";
print check($$hash_ref{'my_test'}), "\$\$hash_ref{'my_test'} = " . $$hash_ref{'my_test'} . "\n";
(hlovdal) localhost:/tmp/my_test>
在运行测试时,文本由于某种原因被压缩为iso-8859-1。设置binmode排序可以解决它,但会导致其他字符串的双重编码。
(hlovdal) localhost:/tmp/my_test>cat my_test.json
{ "my_test" : "hei på deg" }
(hlovdal) localhost:/tmp/my_test>file my_test.json
my_test.json: UTF-8 Unicode text
(hlovdal) localhost:/tmp/my_test>hexdump -c my_test.json
0000000 { " m y _ t e s t " : " h
0000010 e i p 303 245 d e g " } \n
000001e
(hlovdal) localhost:/tmp/my_test>
(hlovdal) localhost:/tmp/my_test>perl my_test.pl
is_utf8(): 0, is_utf8(1): 0. $my_test = hei på deg
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" }
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei p� deg
(hlovdal) localhost:/tmp/my_test>perl my_test.pl --set-binmode
is_utf8(): 0, is_utf8(1): 0. $my_test = hei på deg
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" }
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei på deg
(hlovdal) localhost:/tmp/my_test>
造成这种情况的原因是什么以及如何解决?
这是在新安装的最新Fedora 15系统上。
(hlovdal) localhost:/tmp/my_test>perl --version | grep version
This is perl 5, version 12, subversion 4 (v5.12.4) built for x86_64-linux-thread-multi
(hlovdal) localhost:/tmp/my_test>rpm -q perl-JSON
perl-JSON-2.51-1.fc15.noarch
(hlovdal) localhost:/tmp/my_test>locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
(hlovdal) localhost:/tmp/my_test>
更新:添加use utf8
无法解决问题,但仍未正确处理字符(虽然与之前略有不同):
(hlovdal) localhost:/tmp/my_test>perl my_test.pl
is_utf8(): 1, is_utf8(1): 1. $my_test = hei p� deg
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" }
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei p� deg
(hlovdal) localhost:/tmp/my_test>perl my_test.pl --set-binmode
is_utf8(): 1, is_utf8(1): 1. $my_test = hei på deg
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" }
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei på deg
(hlovdal) localhost:/tmp/my_test>
所述
我可以在Perl源中使用Unicode吗?
是的,你可以!如果您的来源是 UTF-8编码,你可以指出 使用utf8 pragma。
use utf8;
这对你没有任何帮助 输入,或输出。它只是 影响你的来源 读。您可以在字符串中使用Unicode 文字,标识符(但它们 仍然必须是“单词字符” 根据\ w),甚至是习俗 分隔符。
答案 0 :(得分:9)
您以UTF-8保存了程序,但忘了告诉Perl。添加use utf8;
。
另外,你编程太复杂了。 JSON函数DWYM。要检查内容,请使用Devel :: Peek。
use utf8; # for the following line
my $my_test = 'hei på deg';
use Devel::Peek qw(Dump);
use File::Slurp (read_file);
use JSON qw(decode_json);
my $hash_ref = decode_json(read_file('my_test.json'));
Dump $hash_ref; # Perl character strings
Dump $my_test; # Perl character string
答案 1 :(得分:0)
这只是我的印象,还是这个perl库希望你将UTF-8字节代码写入isoLatin1字符串(字符串上禁用utf-8标志); 同样,它会在iso拉丁字符串中返回UTF-8字节代码:
#! /usr/bin/perl -w
use strict;
use Encode;
use Data::Dumper qw(Dumper);
use JSON; # imports encode_json, decode_json, to_json and from_json.
use utf8;
###############
## EXAMPLE 1:
################
my $json = JSON->new->allow_nonref;
my $exampleAJsonObj = { key1 => 'a'};
my $exampleAText = $json->utf8->encode( $exampleAJsonObj );
my $exampleAJsonObfUtf = { key1 => 'ä'};
my $exampleATextUtf = $json->utf8->encode( $exampleAJsonObfUtf);
#binmode(STDOUT, ":utf8");
print "EXAMPLE1: ";
print "\n";
print encode 'UTF-8', "exampleAText: $exampleAText and as object: " . Dumper($exampleAJsonObj);
print "\n";
print encode 'UTF-8', "exampleATextUtf: $exampleATextUtf and as object: " . Dumper($exampleAJsonObfUtf) . " Key1 was: " . $exampleAJsonObfUtf->{key1};
print "\n";
print hexdump($exampleAText);
print "\n";
print hexdump($exampleATextUtf);
print "\n";
#############################
## SUB.
#############################
# For a given string parameter, returns a string which shows
# whether the utf8 flag is enabled and a byte-by-byte view
# of the internal representation.
#
sub hexdump
{
my $str = shift;
my $flag = Encode::is_utf8($str) ? 1 : 0;
use bytes; # this tells unpack to deal with raw bytes
my @internal_rep_bytes = unpack('C*', $str);
return
$flag
. '('
. join(' ', map { sprintf("%02x", $_) } @internal_rep_bytes)
. ')';
}
最后,输出是:
exampleAText: {"key1":"a"} and as object: $VAR1 = {
'key1' => 'a'
};
exampleATextUtf: {"key1":"ä"} and as object: $VAR1 = {
'key1' => "\x{e4}"
};
Key1 was: ä
0(7b 22 6b 65 79 31 22 3a 22 61 22 7d)
0(7b 22 6b 65 79 31 22 3a 22 c3 a4 22 7d)
因此,我们看到在此过程结束时,outpu字符串都不是UTF-8字符串,这是错误的。至少,0(7b 22 6b 65 79 31 22 3a 22 c3 a4 22 7d)。 请注意,c3 A4是ä的正确字节代码 http://www.utf8-chartable.de/
因此,库似乎期望一个人将utf-8字符串编码为非utf-8字符串,因此,它将执行相同的操作,它将输出一个带有utf的NON utf-8字符串8字节代码。
我错了吗?
进一步的实验让我得出以下结论: perlObjects返回并消耗的字符串标记为UTF-8(正如我预期的那样)。 消费并从decode / encode返回的perl字符串必须看起来像perl作为ISO latin 1字符串但具有utf8字节代码。 因此,在打开包含UTF8 json的文件时,请勿使用“&lt ;: encoding(UTF-8)”。
答案 2 :(得分:-1)
问题的核心是JSON期望八位字节数组而不是字符串(在this question中求解)。但是我也错过了几个与unicode相关的东西,比如“use utf8”。下面是使示例中的代码完全运行所需的差异:
--- my_test.pl.orig 2011-08-03 15:44:44.217868886 +0200
+++ my_test.pl 2011-08-03 15:55:30.152379269 +0200
@@ -1,19 +1,14 @@
-#!/usr/bin/perl -w
+#!/usr/bin/perl -CSAD
use strict;
use warnings;
use JSON;
use File::Slurp;
use Getopt::Long;
use Encode;
-
-my $set_binmode = 0;
-GetOptions("set-binmode" => \$set_binmode);
-
-if ($set_binmode) {
- binmode(STDIN, ":encoding(UTF-8)");
- binmode(STDOUT, ":encoding(UTF-8)");
- binmode(STDERR, ":encoding(UTF-8)");
-}
+use utf8;
+use warnings qw< FATAL utf8 >;
+use open qw( :encoding(UTF-8) :std );
+use feature qw< unicode_strings >;
sub check {
my $text = shift;
@@ -21,8 +16,9 @@
}
my $my_test = "hei på deg";
-my $json_text = read_file('my_test.json');
-my $hash_ref = JSON->new->utf8->decode($json_text);
+my $json_text = read_file('my_test.json', binmode => ':encoding(UTF-8)');
+my $json_bytes = encode('UTF-8', $json_text);
+my $hash_ref = JSON->new->utf8->decode($json_bytes);
print check($my_test), "\$my_test = $my_test\n";
print check($json_text), "\$json_text = $json_text";