Question

我有一些这样的样本：

2
00:01:32,288 --> 00:01:33,208
¬O¥LÌ¶Ü¡H
How are you?

3
00:01:36,768 --> 00:01:39,648
€Ñ°Ú¡A¥LÌ¥ŽºâŽN³o»ò°µ¶Ü¡H
âŽN³o»ò°µ¶Ü¡H
I am fine
And you ?

--------------------这是我的解决方案，但它不完整

#!/usr/bin/perl -w
$lineIndex = 0;
while($line=<>){
    $lineIndex++;   #line index start from 1
    $content{$lineIndex}=$line;  #copy to content
    for($i = 0; $i < length ($line); $i++){
        $char = substr $line,$i,1;
        if($char =~ /\W/){
            #print $char;
            $count{$lineIndex}++; #how many special char this line
        }
    }
}
# if line contains more than 14 special char,then skip
print "\n";
for $i (keys %count){
    if($count{$i} > 14){       #<----------------see here
        delete $content{$i};#delete from content
    }
}

for $j (sort keys %content){ #output
    print $content{$j};
}

我的解决方案有这个问题： O J b յX是未匹配，因为它的长度<= 14 如果将阈值更改为小数，例如将匹配00：01：33,208的字符串，则从内容中删除

有没有一种方法可以在utf-8中查看char？

Answer 1

这是一个更简单的解决方案：

while($line = <>) {
    print $line unless $line =~ /[^\x00-\x7e]/;
}

字符集[\x00-\x7e]涵盖所有基本ASCII字符（包括控制字符）。

Answer 2

#!/usr/bin/perl -w
$lineIndex = 0;
$flag = 1;
while($line=<>){
    $line = join('',$line);
    $lineIndex++;  
    if($line =~ /\d\d:\d\d:\d\d,\d\d\d/){
    print $line;
    next;
    }
    for($i = 0; $i < length ($line); $i++){
        $char = substr $line,$i,1;
        if($char !~ /[\w ,.;!?'\-\r\n]/){
            $flag = 0;
        last;
        }else{
        $flag = 1;
        }
    }
    if($flag == 1){
       print $line;
    }
}

如何批量删除字幕中的外语

2 个答案: