如何用空格替换无效的UTF8字符

时间:2014-03-19 16:18:19

标签: perl unix utf-8

我有一个固定宽度的文件,它有一些非UTF8字符,我想用空格替换非UTF8字符。

我试图运行iconv -f utf8 -t utf8 -c $file 但它唯一能做的就是删除非UTF8字符。使用iconv无法用空格替换它们。

我想要一个korn shell脚本/ perl脚本来替换所有带有空格的非utf8字符。

我发现这个perl脚本打印出找到非utf8字符的行,但我不知道任何关于perl的内容,以便用空格替换非UTF8。

perl -l -ne '/
   ^( [\000-\177]                 # 1-byte pattern
     |[\300-\337][\200-\277]      # 2-byte pattern
     |[\340-\357][\200-\277]{2}   # 3-byte pattern
     |[\360-\367][\200-\277]{3}   # 4-byte pattern
     |[\370-\373][\200-\277]{4}   # 5-byte pattern
     |[\374-\375][\200-\277]{5}   # 6-byte pattern
    )*$ /x or print' FILE.dat

环境AIX

1 个答案:

答案 0 :(得分:2)

Perl的Encode模块具备此功能。

#!/usr/bin/perl

use strict;
use warnings;

use Encode qw(encode decode);

while (<>) {
   # decode the utf-8 bytes and make them into characters
   # and turn anything that's invalid into U+FFFD
   my $string = decode("utf-8", $_);

   # change any U+FFFD into spaces
   $string =~ s/\x{fffd}/ /g;

   # turn it back into utf-8 bytes and print it back out again
   print encode("utf-8", $string);
}

或者较小的命令行版本:

perl -pe 'use Encode; $_ = Encode::decode("utf-8",$_); s/\x{fffd}/ /g; $_ = Encode::encode("utf-8", $_)'