Question

我有一个用于分析的日志文件，因为很少有一行会重复它，但不是完全重复，比如说

Alex is here and Alex is here and we went out
We bothWe both went out

我想删除第一个匹配项并获取

Alex is here and we went out
We both went out

请在Windows中与Vim共享一个正则表达式。

Answer 1

我不建议尝试使用正则表达式魔法来解决此问题。只需编写一个外部过滤器并使用它。

这是一个用Python编写的外部过滤器。您可以使用它来预处理日志文件，如下所示：

python prefix_chop.py logfile.txt > chopped.txt

但它也适用于标准输入：

cat logfile.txt | prefix_chop.py > chopped.txt

这意味着您可以使用!命令在vim中使用它。尝试以下命令：转到第1行，然后从当前行到最后一行通过外部程序prefix_chop.py：

1G
!Gprefix_chop.py<Enter>

或者您可以从ex模式执行此操作：

:1,$!prefix_chop.py<Enter>

以下是该计划：

#!/usr/bin/python

import sys
infile = sys.stdin if len(sys.argv) < 2 else open(sys.argv[1])

def repeated_prefix_chop(line):
    """
    Check line for a repeated prefix string.  If one is found,
    return the line with that string removed, else return the
    line unchanged.
    """
    # Repeated string cannot be more than half of the line.
    # So, start looking at mid-point of the line.
    i = len(line) // 2 + 1

    while True:
        # Look for longest prefix that is found in the string after pos 0.
        # The prefix starts at pos 0 and always matches itself, of course.
        pos = line.rfind(line[:i])
        if pos > 0:
            return line[pos:]
        i -= 1

        # Stop testing before we hit a length-1 prefix, in case a line
        # happens to start with a word like "oops" or a number like "77".
        if i < 2:
            return line

for line in infile:
    sys.stdout.write(repeated_prefix_chop(line))

我在第一行添加了#!注释，因此如果您使用Cygwin，这将在Linux，Mac OS X或Windows上作为独立程序运行。如果您只是在没有Cygwin的情况下使用Windows，则可能需要创建一个批处理文件来运行它，或者只需键入整个命令python prefix_chop.py。如果你创建一个宏来运行它，你不必自己打字。

编辑：这个程序非常简单。也许它可以在“vimscript”中完成并且纯粹在vim中运行。但外部过滤器程序可以在vim之外使用...你可以设置一些东西，以便日常文件每天每天运行一次，如果你愿意的话。

Answer 2

正则表达式：\b(.*)\1\b

替换为：\1或$1

如果你想处理两个以上的重复句子，你可以试试这个

\b(.+?\b)\1+\b
      --
       |->avoids matching individual characters in word like xxx

注意

使用\<和\>代替\b

Answer 3

你可以通过在行的开头尽可能多地匹配，然后使用反向引用来匹配重复的位。

例如，此命令解决了您描述的问题：

:%s/^\(.*\)\(\1.*\)/\2

Vim：如何删除一行中的重复

3 个答案: