什么Python正则表达式将捕获文本中重叠的双字序列(包括收缩)?

时间:2018-04-06 22:48:46

标签: python regex python-3.x

需要对模式进行哪些调整才能获得所需的输出?

from re import findall

s= '''one can't two won't three'''

pat = r'(?=(\b\w+[\w\'\-’]*\b \b\w+[\w\'\-’]*\b))'

s2 = findall(pat, s)
print(s2)

# actual output
# ["one can't", "can't two", 't two', "two won't", "won't three", 't three']

# desired output
# ["one can't", "can't two", "two won't", "won't three"]

2 个答案:

答案 0 :(得分:1)

由于问题是单词boundary \b在撇号后匹配,所以简单的解决方法是使用lookbehind断言匹配前面没有撇号。

后视:

(?<!\')

完整的正则表达式:

(?<!\')(?=(\b\w+[\w\'\-’]*\b \b\w+[\w\'\-’]*\b))

regex101看到它的实际效果。

答案 1 :(得分:1)

这个怎么样?

(?:^|\s+)(?=(\S+\s+\S+))

Demo