我需要从文件中删除所有重复的行,但忽略这些字符的所有外观:
(),、“”。!?#
例如,这两行将被视为重复,因此其中一行将被删除:
“This is a line。“
This is a line
同样,这三行将被视为重复,只剩下一行:
This is another line、 with more words。
“This is another line with more words。”
This is another line! with more words!
如何删除文件中的所有重复行,同时忽略某些字符?
答案 0 :(得分:1)
从您的示例中,您可以删除符号,然后删除重复项。
例如:
$ cat foo
«This is a line¡»
This is another line! with more words¡
Similarly, these three lines would be considered duplicates, and only one would remain:
This is a line
This is another line, with more words!
This is another line with more words
$ tr --delete '¡!«»,' < foo | awk '!a[$0]++'
This is a line
This is another line with more words
Similarly these three lines would be considered duplicates and only one would remain:
$
似乎可以胜任。
编辑:
从您的问题来看,似乎这些符号/标点符号无关紧要。你应该准确的。
我没有时间写这篇文章,但我认为简单的方法应该是解析你的文件并维护已经打印过的行数组:
for each line:
cleanedLine = stripFromSymbol(line)
if cleanedLine not in AlreadyPrinted:
AlreadyPrinted.push(cleanedLine)
print line
答案 1 :(得分:1)
这是一种方法。您将它们收集到标准化版本上的数组中。这里标准化意味着删除你不想要的所有字符并压缩空格。然后它选择最短版本进行打印/保留。那个启发式 - 保留 - 并没有真正指定这样的季节。代码对于制作来说有点简洁,所以为了清晰起见,你可能会将其充实。
use utf8;
use strictures;
use open qw/ :std :utf8 /;
my %tree;
while (my $original = <DATA>) {
chomp $original;
( my $normalized = $original ) =~ tr/ (),、“”。!?#/ /sd;
push @{$tree{$normalized}}, $original;
#print "O:",$original, $/;
#print "N:",$normalized, $/;
}
@{$_} = sort { length $a <=> length $b } @{$_} for values %tree;
print $_->[0], $/ for values %tree;
__DATA__
“This is a line。“
This is a line
This is a line
This is another line、 with more words。
This is another line with more words
This is another line! with more words!
产量 -
This is another line with more words
This is a line