如何判断是否utf-8或cp1252编码?

时间:2017-12-12 05:00:41

标签: perl encoding utf-8 cp1252

在perl中是否有办法确定字符串编码的utf-8cp1252中的哪一个?

2 个答案:

答案 0 :(得分:1)

核心Encode::Guess应该可以胜任

use Encode::Guess;

my $enc = guess_encoding($data, qw(cp1252));  # utf8 among defaults

然后

ref($enc) or die "Can't guess: $enc"; # trap error this way
$utf8 = $enc->decode($data);

(来自docs)。

为了也使用默认的“ ascii,utf8和带BOM的UTF-16/32 ”更改第一个

Encode::Guess->set_suspects(qw(utf8 cp1252));

然后获取编码

my $enc = guess_encoding($data);

或者,从文档中复制

my $decoder = Encode::Guess->guess($data);
die $decoder unless ref($decoder);
my $utf8 = $decoder->decode($data);

有关详细信息,请参阅文档。

存在很多差异;请参阅tripleee的评论,例如this post

答案 1 :(得分:1)

my $could_be_utf8 = utf8::decode( my $tmp = $string );

my $could_be_cp1252 = $string !~ /[\x81\x8D\x8F\x90\x9D]/;

如果您需要处理包含两者混合的字符串,请参阅Fixing a file consisting of both UTF-8 and Windows-1252