Question

我使用perl脚本从其他服务器下载CSV文件。下载后我想检查文件是否包含任何损坏的数据。我尝试使用Encode :: Detect :: Detector来检测编码，但在两种情况下都返回'undef'：

如果字符串是ASCII或
如果字符串已损坏

所以使用下面的程序我无法区分ASCII和＆amp;数据损坏。

 use strict;
 use Text::CSV;
 use Encode::Detect::Detector;
 use XML::Simple;
 use Encode;
 require Encode::Detect;

 my @rows;
 my $init_file = "new-data-jp-2013-8-8.csv";



 my $csv = Text::CSV->new ( { binary => 1 } )
                 or die "Cannot use CSV: ".Text::CSV->error_diag ();

 open my $fh, $init_file or die $init_file.": $!";

 while ( my $row = $csv->getline( $fh ) ) {
     my @fields = @$row; # get line into array
     for (my $i=1; $i<=23; $i++){  # I already know that CSV file has 23 columns
            if ((Encode::Detect::Detector::detect($fields[$i-1])) eq undef){
                print "the encoding is undef in col".$i.
                            "  where field is ".$fields[$i-1].
                            " and its length is  ".length($fields[$i-1])." \n";
            }
            else {
            my $string = decode("Detect", $fields[$i-1]);
            print "this is string print  ".$string.
                    " the encoding is ".Encode::Detect::Detector::detect($fields[$i-1]).
                    " and its length is  ".length($fields[$i-1])."\n";
            }
        }   
     }

Answer 1

您对编码有一些错误的假设，以及脚本中的一些错误。

foo() eq undef

没有任何意义。您无法将字符串相等性与undef进行比较，因为undef不是字符串。但是，它确实是字符串化为空字符串。当你做这样的垃圾时，你应use warnings得到错误信息。要测试某个值是否不是undef，请使用defined：

unless(defined foo()) { .... }

Encode::Detector::Detect模块使用面向对象的接口。因此，

Encode::Detect::Detector::detect($foo)

错误。根据{{3}}，您应该这样做

Encode::Detect::Detector->detect($foo)

您可能无法逐个字段地进行解码。通常，一个文档有一个编码。打开文件句柄时需要指定编码，例如

use autodie;
open my $fh, "<:utf8", $init_file;

虽然CSV可以支持某种程度的二进制数据（如编码文本），但它不太适合此目的，您可能想要选择其他数据格式。

最后，ASCII数据实际上不需要任何去编码或编码。编码检测的undef结果在这里有意义。它不能肯定地声明文档被编码为ASCII（因为许多编码是ASCII的超集），但是给定某个文档可以断言它不是有效的ASCII（即具有第8位设置）但必须是更复杂的编码，如Latin-1，UTF-8。

perl如何检测CSV文件中的损坏数据？

1 个答案: