Encode :: Guess-> guess给guess_encoding提供不同的结果

时间:2014-12-19 13:36:58

标签: perl character-encoding

我有这个脚本。 Somefile.xsd是一个包含几个UTF-8字符的文件。然而,我发现我无法guess_encoding报告与Encode :: Guess-> guess相同的编码。忽略它是一个XSD的事实,明显的事情(我确定它可能是显而易见的)我错过了我还没有完成?

use Encode;
use Encode::Guess;

open (FILE, "<", "somefile.xsd");

print ("Reading file...\n");
#$text = <FILE>;
while ($text = <FILE>) {
    $encoding1 = Encode::Guess->guess($text);
    if (ref($encoding1)) {
        $name = $encoding1->name;
        print "$name : $text" if ($name ne "ascii");
    } else {
        print ("Not found : $text");
    }

    $encoding2 = guess_encoding($text, qw/iso-8859-15 ascii iso-8859-1 utf8/);
    if (ref($encoding2)) {
        $name = $encoding2->name;
        print "$name : $text" if ($name ne "ascii");
    } else {
        print ("Not found : $text");
    }

}

close(FILE);

当我运行它时,它会给出以下结果:

H:\play>perl encoding.pl
Reading file...
utf8 :                  <xs:enumeration value="Bokmål, Norwegian; Norwegian Bokmål"/>
Not found :                     <xs:enumeration value="Bokmål, Norwegian; Norwegian Bokmål"/>
utf8 :                  <xs:enumeration value="Occitan (post 1500); Provenæ ¬"/>
Not found :                     <xs:enumeration value="Occitan (post 1500); Provenæ ¬"/>
utf8 :                  <xs:enumeration value="Volapük"/>
Not found :                     <xs:enumeration value="Volapük"/>

编辑澄清:我想使用guess_encoding版本和第二个选项(即嫌疑人列表)。删除列表只会调用Encode::Guess->guess。用例是我想检查一个文件是否与一组编码中的一个匹配,并且传递有效列表似乎比调用guess并在列表中查找名称更加优雅,特别是当我有{{ 1}}给我一个$encoding->name的结果,这意味着我不能简单地检查列表是否相等。

1 个答案:

答案 0 :(得分:0)

尝试删除qw:

$encoding2 = guess_encoding($text);

这应该给你正确的答案。

EDIT。

运行此代码:

use Encode;
use Encode::Guess;

open (FILE, "<", "somefile.xsd");

print ("Reading file...\n");
#$text = <FILE>;
while ($text = <FILE>) {
    $encoding1 = Encode::Guess->guess($text);
    if (ref($encoding1)) {
        $name = $encoding1->name;
        print "$name : $text" if ($name ne "ascii");
    } else {
        print ("Not found : $text");
    }

   $encoding2 = guess_encoding($text, qw/iso-8859-15 ascii iso-8859-1 utf8/);

    if (ref($encoding2)) {
        $name = $encoding2->name;
        print "$name : $text" if ($name ne "ascii");
    } else {
        print ("Not found : $text");
    }

   $encoding3 = guess_encoding($text);

    if (ref($encoding3)) {
        $name = $encoding3->name;
        print "$name : $text" if ($name ne "ascii");
    } else {
        print ("Not found : $text");
    }

    print "-"x40 ."\n";
}

close(FILE);

产生

Reading file...
utf8 : <xs:enumeration value="Bokmål, Norwegian; Norwegian Bokmål"/>
Not found : <xs:enumeration value="Bokmål, Norwegian; Norwegian Bokmål"/>
utf8 : <xs:enumeration value="Bokmål, Norwegian; Norwegian Bokmål"/>
----------------------------------------
utf8 : <xs:enumeration value="Bokmål, Norwegian; Norwegian Bokmål"/>
Not found : <xs:enumeration value="Bokmål, Norwegian; Norwegian Bokmål"/>
utf8 : <xs:enumeration value="Bokmål, Norwegian; Norwegian Bokmål"/>
----------------------------------------
utf8 : <xs:enumeration value="Occitan (post 1500); Provenæ ¬"/>
Not found : <xs:enumeration value="Occitan (post 1500); Provenæ ¬"/>
utf8 : <xs:enumeration value="Occitan (post 1500); Provenæ ¬"/>
----------------------------------------
utf8 : <xs:enumeration value="Occitan (post 1500); Provenæ ¬"/>
Not found : <xs:enumeration value="Occitan (post 1500); Provenæ ¬"/>
utf8 : <xs:enumeration value="Occitan (post 1500); Provenæ ¬"/>
----------------------------------------
utf8 : <xs:enumeration value="Volap├╝k"/>
Not found : <xs:enumeration value="Volap├╝k"/>
utf8 : <xs:enumeration value="Volap├╝k"/>
----------------------------------------
utf8 : <xs:enumeration value="Volap├╝k"/>
Not found : <xs:enumeration value="Volap├╝k"/>
utf8 : <xs:enumeration value="Volap├╝k"/>
----------------------------------------