如何使用正则表达式仅保留前n个重复的单词

时间:2019-07-24 22:26:19

标签: python regex

如果我有输入句子

input = 'ok ok, it is very very very very very hard'

我想做的是仅保留任何重复单词的前三个副本:

output = 'ok ok, it is very very very hard'

如何使用python中的reregex模块实现这一目标?

3 个答案:

答案 0 :(得分:1)

一种选择是使用具有反向引用的捕获组,并将其用于替换。

((\w+)(?: \2){2})(?: \2)*

说明

  • (捕获组1
    • (\w+)捕获组2,匹配1个以上的字符,字符(示例数据仅使用单词字符。要确保它们不属于较大的单词,请使用单词边界\b
    • (?: \2){2}重复2次,以匹配一个空格和向组2的反向引用。您可以使用[ \t]+来匹配1个以上的空格或制表符,也可以使用\s+来匹配2个空格或制表符1个以上空白字符。 (请注意,这也将与换行符匹配)
  • )关闭第1组
  • (?: \2)*匹配0+乘以空格和指向组2的后向引用,以匹配要删除的相同单词

Regex demo | Python demo

例如

import re

regex = r"((\w+)(?: \2){2})(?: \2)*"
s = "ok ok, it is very very very very very hard"
result = re.sub(regex, r"\1", s)

if result:
    print (result)

结果

ok ok, it is very very very hard

答案 1 :(得分:1)

您可以对单词进行分组,并使用向后引用来引用它,以确保重复两次以上:

import re
print(re.sub(r'\b((\w+)(?:\s+\2){2})(?:\s+\2)+\b', r'\1', input))

这将输出:

ok ok, it is very very very hard

答案 2 :(得分:0)

具有re.sub和自定义功能的一种解决方案:

s = 'ok ok, it is very very very very very hard'

def replace(n=3):
    last_word, cnt = '', 0
    current_word = yield

    while True:
        if last_word == current_word:
            cnt += 1
        else:
            cnt = 0

        last_word = current_word

        if cnt >= n:
            current_word = yield ''
        else:
            current_word = yield current_word

import re

replacer = replace()
next(replacer)
print(re.sub(r'\s*[\w]+\s*', lambda g: replacer.send(g.group(0)), s))

打印:

ok ok, it is very very very hard