如果我有输入句子
input = 'ok ok, it is very very very very very hard'
我想做的是仅保留任何重复单词的前三个副本:
output = 'ok ok, it is very very very hard'
如何使用python中的re
或regex
模块实现这一目标?
答案 0 :(得分:1)
一种选择是使用具有反向引用的捕获组,并将其用于替换。
((\w+)(?: \2){2})(?: \2)*
说明
(
捕获组1
(\w+)
捕获组2,匹配1个以上的字符,字符(示例数据仅使用单词字符。要确保它们不属于较大的单词,请使用单词边界\b
)(?: \2){2}
重复2次,以匹配一个空格和向组2的反向引用。您可以使用[ \t]+
来匹配1个以上的空格或制表符,也可以使用\s+
来匹配2个空格或制表符1个以上空白字符。 (请注意,这也将与换行符匹配))
关闭第1组(?: \2)*
匹配0+乘以空格和指向组2的后向引用,以匹配要删除的相同单词例如
import re
regex = r"((\w+)(?: \2){2})(?: \2)*"
s = "ok ok, it is very very very very very hard"
result = re.sub(regex, r"\1", s)
if result:
print (result)
结果
ok ok, it is very very very hard
答案 1 :(得分:1)
您可以对单词进行分组,并使用向后引用来引用它,以确保重复两次以上:
import re
print(re.sub(r'\b((\w+)(?:\s+\2){2})(?:\s+\2)+\b', r'\1', input))
这将输出:
ok ok, it is very very very hard
答案 2 :(得分:0)
具有re.sub
和自定义功能的一种解决方案:
s = 'ok ok, it is very very very very very hard'
def replace(n=3):
last_word, cnt = '', 0
current_word = yield
while True:
if last_word == current_word:
cnt += 1
else:
cnt = 0
last_word = current_word
if cnt >= n:
current_word = yield ''
else:
current_word = yield current_word
import re
replacer = replace()
next(replacer)
print(re.sub(r'\s*[\w]+\s*', lambda g: replacer.send(g.group(0)), s))
打印:
ok ok, it is very very very hard