Question

需要对模式进行哪些调整才能获得所需的输出？

from re import findall

s= '''one can't two won't three'''

pat = r'(?=(\b\w+[\w\'\-’]*\b \b\w+[\w\'\-’]*\b))'

s2 = findall(pat, s)
print(s2)

# actual output
# ["one can't", "can't two", 't two', "two won't", "won't three", 't three']

# desired output
# ["one can't", "can't two", "two won't", "won't three"]

Answer 1

由于问题是单词boundary \b在撇号后匹配，所以简单的解决方法是使用lookbehind断言匹配前面没有撇号。

后视：

(?<!\')

完整的正则表达式：

(?<!\')(?=(\b\w+[\w\'\-’]*\b \b\w+[\w\'\-’]*\b))

在regex101看到它的实际效果。

Answer 2

这个怎么样？

(?:^|\s+)(?=(\S+\s+\S+))

Demo

什么Python正则表达式将捕获文本中重叠的双字序列（包括收缩）？

2 个答案: