Perl处理格式错误的字符

时间:2014-04-06 11:50:26

标签: perl encoding

我喜欢关于Perl的建议。

我有要使用Perl处理的文本文件。这些文本文件以cp932编码,但由于某些原因,它们可能包含格式错误的字符。

我的程序就像:

#! /usr/bin/perl -w

use strict;
use encoding 'utf-8';

# 'workfile.txt' is supposed to be encoded in cp932
open my $in, "<:encoding(cp932)", "./workfile.txt";

while ( my $line = <$in> ) {

  # my process comes here

  print $line;

}

如果workfile.txt包含格式错误的字符,Perl会抱怨:

cp932 "\x81" does not map to Unicode at ./my_program.pl line 8, <$in> line 1234.

Perl知道其输入是否包含格式错误的字符。所以我想重写以查看我的输入是好还是坏并且相应地采取行动,比如打印所有好的行(不包含格式错误的字符的行)输出文件句柄A,并打印包含格式错误的字符的行以输出文件句柄B.

#! /usr/bin/perl -w

use strict;
use encoding 'utf-8';
use English;

# 'workfile.txt' is supposed to be encoded in cp932
open my $in, "<:encoding(cp932)", "./workfile.txt";

open my $output_good, ">:encoding(utf8)", "good.txt";
open my $output_bad,  ">:encoding(utf8)", "bad.txt";

select $output_good;   # in most cases workfile.txt lines are good

while ( my $line = <$in> ) {

  if ( $line contains malformed characters ) {

    select $output_bad;

  }

  print "$INPUT_LINE_NUMBER: $line";

  select $output_good;

}

我的问题是我如何写这个&#34; if($ line包含maloomed字符)&#34;部分。如何检查输入是好还是坏。

提前致谢。

1 个答案:

答案 0 :(得分:3)

#! /usr/bin/perl -w

use strict;

use utf8;                             # Source encoded using UTF-8
use open ':std', ':encoding(UTF-8)';  # STD* is UTF-8;
                                      #   UTF-8 is default encoding for open.
use Encode qw( decode );

open my $fh_in,   "<:raw", "workfile.txt"
   or die $!;
open my $fh_good, ">",     "good.txt"
   or die $!;
open my $fh_bad,  ">:raw", "bad.txt"
   or die $!;

while ( my $line = <$fh_in> ) {
   my $decoded_line =
      eval { decode('cp932', $line, Encode::FB_CROAK|Encode::LEAVE_SRC) };
   if (defined($decoded_line)) {
      print($fh_good "$. $decoded_line");
   } else {
      print($fh_bad  "$. $line");
   }
}