Question

我有很长的行列表，其中包含很多情况，第一个单词相同的行（空格前的第一个字符串），但其余的不同。我只需要保留一行唯一的第一个字符串。

john jane
john 123
john jim jane
jane john
jane 123
jane 456
jim
jim 1

要获得此结果：

john jane
jane john
jim

因此，如果第一个单词是匹配的，则删除除一行之外的所有单词。

我可以删除所有重复的行，但是留下如上例所示的行，

^(.*)(\r?\n\1)+$

此正则表达式删除相同的行，与示例中的不同。如果有正则表达式或记事本宏来解决这个问题？

Answer 1

如果您有awk

awk '!seen[$1]++' infile.txt

改编自此主题：Unix: removing duplicate lines without sorting

Answer 2

使用Notepad ++ （假设具有相同第一个单词的行是连续的）：

搜索：^(\S++).*\K(?:\R\1(?:\h.*|$))+
替换：没有什么

demo

模式细节：

^             # start of the line
(\S++)        # the first "word" (all that isn't a whitespace) captured in group 1
.*            # all characters until the end of the line
\K            # remove characters matched before from the match result
(?:
    \R        # a newline
    \1        # reference to the capture group 1 (same first word)
    (?:
        \h.*  # a horizontal whitespace 
      |       # OR
        $     # the end of the line
    )
)+            # repeat one or more times

Answer 3

Perl：

s/^((\w+).*)\n(?:(?:\2.*\n)*)/$1/gm

你可以尝试一下这个：

#!/bin/usr/perl

use warnings;
use strict;

my $file = "john jane
john 123
john jim jane
jane john
jane 123
jane 456
jim
jim 1
";

$file =~ s/^((\w+).*)\n(?:(?:\2.*\n)*)/$1\n/gm;

print $file;

正则表达式删除与第一个字符串匹配的行？

3 个答案: