如何在忽略特定字符的同时删除重复的行?

时间:2014-02-21 11:30:10

标签: perl bash duplicate-removal

我需要从文件中删除所有重复的行,但忽略这些字符的所有外观:

(),、“”。!?#

例如,这两行将被视为重复,因此其中一行将被删除:

“This is a line。“
This is a line

同样,这三行将被视为重复,只剩下一行:

This is another line、 with more words。
“This is another line with more words。”
This is another line! with more words!
  • 文档中保留哪些重复行并不重要。
  • 删除重复项后,不应更改行的顺序。
  • 几乎所有的线都有重要的标点符号,但标点符号可能会有所不同。无论保留哪一行都可能仍有标点符号,因此不应在最终输出中删除标点符号。

如何删除文件中的所有重复行,同时忽略某些字符?

2 个答案:

答案 0 :(得分:1)

从您的示例中,您可以删除符号,然后删除重复项。

例如:

$ cat foo
«This is a line¡»
This is another line! with more words¡

Similarly, these three lines would be considered duplicates, and only one would remain:
This is a line

This is another line, with more words!
This is another line with more words

$ tr --delete '¡!«»,' < foo | awk '!a[$0]++'
This is a line
This is another line with more words

Similarly these three lines would be considered duplicates and only one would remain:

$

似乎可以胜任。

编辑:

从您的问题来看,似乎这些符号/标点符号无关紧要。你应该准确的。

我没有时间写这篇文章,但我认为简单的方法应该是解析你的文件并维护已经打印过的行数组:

for each line:
  cleanedLine = stripFromSymbol(line)
  if cleanedLine not in AlreadyPrinted:
    AlreadyPrinted.push(cleanedLine)
    print line

答案 1 :(得分:1)

这是一种方法。您将它们收集到标准化版本上的数组中。这里标准化意味着删除你不想要的所有字符并压缩空格。然后它选择最短版本进行打印/保留。那个启发式 - 保留 - 并没有真正指定这样的季节。代码对于制作来说有点简洁,所以为了清晰起见,你可能会将其充实。

use utf8;
use strictures;
use open qw/ :std :utf8 /;

my %tree;
while (my $original = <DATA>) {
    chomp $original;
    ( my $normalized = $original ) =~ tr/ (),、“”。!?#/ /sd;
    push @{$tree{$normalized}}, $original;
    #print "O:",$original, $/;                                                                                                                    
    #print "N:",$normalized, $/;                                                                                                                  
}

@{$_} = sort { length $a <=> length $b } @{$_} for values %tree;

print $_->[0], $/ for values %tree;

__DATA__
“This is a line。“
This is a line
This  is   a line
This is another line、 with more words。
This is another line with more words
This is another line! with more words!

产量 -

This is another line with more words
This is a line