我有一个固定宽度的文件,它有一些非UTF8字符,我想用空格替换非UTF8字符。
我试图运行iconv -f utf8 -t utf8 -c $file
但它唯一能做的就是删除非UTF8字符。使用iconv无法用空格替换它们。
我想要一个korn shell脚本/ perl脚本来替换所有带有空格的非utf8字符。
我发现这个perl脚本打印出找到非utf8字符的行,但我不知道任何关于perl的内容,以便用空格替换非UTF8。
perl -l -ne '/
^( [\000-\177] # 1-byte pattern
|[\300-\337][\200-\277] # 2-byte pattern
|[\340-\357][\200-\277]{2} # 3-byte pattern
|[\360-\367][\200-\277]{3} # 4-byte pattern
|[\370-\373][\200-\277]{4} # 5-byte pattern
|[\374-\375][\200-\277]{5} # 6-byte pattern
)*$ /x or print' FILE.dat
环境AIX
答案 0 :(得分:2)
Perl的Encode模块具备此功能。
#!/usr/bin/perl
use strict;
use warnings;
use Encode qw(encode decode);
while (<>) {
# decode the utf-8 bytes and make them into characters
# and turn anything that's invalid into U+FFFD
my $string = decode("utf-8", $_);
# change any U+FFFD into spaces
$string =~ s/\x{fffd}/ /g;
# turn it back into utf-8 bytes and print it back out again
print encode("utf-8", $string);
}
或者较小的命令行版本:
perl -pe 'use Encode; $_ = Encode::decode("utf-8",$_); s/\x{fffd}/ /g; $_ = Encode::encode("utf-8", $_)'