我有一些这样的样本:
2
00:01:32,288 --> 00:01:33,208
¬O¥L̶ܡH
How are you?
3
00:01:36,768 --> 00:01:39,648
€Ñ°Ú¡A¥LÌ¥ŽºâŽN³o»ò°µ¶Ü¡H
âŽN³o»ò°µ¶Ü¡H
I am fine
And you ?
--------------------这是我的解决方案,但它不完整
#!/usr/bin/perl -w
$lineIndex = 0;
while($line=<>){
$lineIndex++; #line index start from 1
$content{$lineIndex}=$line; #copy to content
for($i = 0; $i < length ($line); $i++){
$char = substr $line,$i,1;
if($char =~ /\W/){
#print $char;
$count{$lineIndex}++; #how many special char this line
}
}
}
# if line contains more than 14 special char,then skip
print "\n";
for $i (keys %count){
if($count{$i} > 14){ #<----------------see here
delete $content{$i};#delete from content
}
}
for $j (sort keys %content){ #output
print $content{$j};
}
我的解决方案有这个问题: O J b յX是未匹配,因为它的长度<= 14 如果将阈值更改为小数,例如将匹配00:01:33,208的字符串,则从内容中删除
有没有一种方法可以在utf-8中查看char?
答案 0 :(得分:2)
这是一个更简单的解决方案:
while($line = <>) {
print $line unless $line =~ /[^\x00-\x7e]/;
}
字符集[\x00-\x7e]
涵盖所有基本ASCII字符(包括控制字符)。
答案 1 :(得分:0)
#!/usr/bin/perl -w
$lineIndex = 0;
$flag = 1;
while($line=<>){
$line = join('',$line);
$lineIndex++;
if($line =~ /\d\d:\d\d:\d\d,\d\d\d/){
print $line;
next;
}
for($i = 0; $i < length ($line); $i++){
$char = substr $line,$i,1;
if($char !~ /[\w ,.;!?'\-\r\n]/){
$flag = 0;
last;
}else{
$flag = 1;
}
}
if($flag == 1){
print $line;
}
}