如何批量删除字幕中的外语

时间:2011-11-11 01:29:56

标签: perl

我有一些这样的样本:

2
00:01:32,288 --> 00:01:33,208
¬O¥L­Ì¶Ü¡H
How are you?

3
00:01:36,768 --> 00:01:39,648
€Ñ°Ú¡A¥L­Ì¥ŽºâŽN³o»ò°µ¶Ü¡H
âŽN³o»ò°µ¶Ü¡H
I am fine
And you ?

--------------------这是我的解决方案,但它不完整

#!/usr/bin/perl -w
$lineIndex = 0;
while($line=<>){
    $lineIndex++;   #line index start from 1
    $content{$lineIndex}=$line;  #copy to content
    for($i = 0; $i < length ($line); $i++){
        $char = substr $line,$i,1;
        if($char =~ /\W/){
            #print $char;
            $count{$lineIndex}++; #how many special char this line
        }
    }
}
# if line contains more than 14 special char,then skip
print "\n";
for $i (keys %count){
    if($count{$i} > 14){       #<----------------see here
        delete $content{$i};#delete from content
    }
}

for $j (sort keys %content){ #output
    print $content{$j};
}

我的解决方案有这个问题: O J b յX是未匹配,因为它的长度<= 14 如果将阈值更改为小数,例如将匹配00:01:33,208的字符串,则从内容中删除

有没有一种方法可以在utf-8中查看char?

2 个答案:

答案 0 :(得分:2)

这是一个更简单的解决方案:

while($line = <>) {
    print $line unless $line =~ /[^\x00-\x7e]/;
}

字符集[\x00-\x7e]涵盖所有基本ASCII字符(包括控制字符)。

答案 1 :(得分:0)

#!/usr/bin/perl -w
$lineIndex = 0;
$flag = 1;
while($line=<>){
    $line = join('',$line);
    $lineIndex++;  
    if($line =~ /\d\d:\d\d:\d\d,\d\d\d/){
    print $line;
    next;
    }
    for($i = 0; $i < length ($line); $i++){
        $char = substr $line,$i,1;
        if($char !~ /[\w ,.;!?'\-\r\n]/){
            $flag = 0;
        last;
        }else{
        $flag = 1;
        }
    }
    if($flag == 1){
       print $line;
    }
}