Question

我希望找到两个子串之间的所有字符串，同时保留第一个子字符串并丢弃第二个子字符串。但是，子串可能是几个值之一。例如，如果这些是可能的子串：

subs = ['MIKE','WILL','TOM','DAVID']

我希望得到以下任何一个之间的字符串：

Input:

text = 'MIKE an entry for mike WILL and here is wills text DAVID and this belongs to david'

Output:

[('MIKE': 'an entry for mike'),
 ('WILL': 'and here is wills text'),
 ('DAVID': 'and this belongs to david')]

尾随空格并不重要。我试过了：

re.findall('(MIKE|WILL|TOM|DAVID)(.*?)(MIKE|WILL|TOM|DAVID)',text)

仅返回第一个匹配项并保留结束子字符串。不太确定最好的方法。

Answer 1

您可以使用

import re
text = 'MIKE an entry for mike WILL and here is wills text DAVID and this belongs to david'
subs = ['MIKE','WILL','TOM','DAVID']
res = re.findall(r'({0})\s*(.*?)(?=\s*(?:{0}|$))'.format("|".join(subs)), text)
print(res)
# => [('MIKE', 'an entry for mike'), ('WILL', 'and here is wills text'), ('DAVID', 'and this belongs to david')]

请参阅Python demo。

<强>详情

(MIKE|WILL|TOM|DAVID) - 第1组匹配其中一个替代子字符串
\s* - 0+ whitespaces
(.*?) - 第2组捕获除了换行符之外的任何0 +字符（使用re.S标志来匹配任何字符），尽可能少，直到第一个......
(?=\s*(?:MIKE|WILL|TOM|DAVID|$)) - 0+空格后跟一个子字符串或字符串结尾（$）。这些文本没有被消耗，因此，正则表达式引擎仍然可以得到随后的匹配。

Answer 2

您还可以使用以下正则表达式来实现目标：

(MIKE.*)(?= WILL)|(WILL.*)(?= DAVID)|(DAVID.*)

它使用Positive lookahead来获取中间字符串。（http://www.rexegg.com/regex-quickstart.html）

测试： https://regex101.com/r/ZSJJVG/1

Python正则表达式查找两个子字符串之间的所有字符串

2 个答案: