Question

我一直在寻找一种有效的方法来找到两个表达式之间的子串，除非表达式是另一个表达式的一部分。

例如：

曾几何时，在遥远的时间里，狗统治世界。结束。

如果我在时间和结束之间搜索子字符串，我会收到：

在遥远的时间里，狗统治世界。在

或

很远的地方，狗统治世界。在

我想忽略时间是曾几何时的一部分。我不知道是否有一个pythonic方法没有使用疯狂的for循环和if / else情况。

Answer 1

使用否定前瞻

可以在正则表达式中实现

>>> s = 'Once upon a time, in a time far far away, dogs ruled the world. The End.'
>>> pattern = r'time((?:(?!time).)*)End'
>>> re.findall(pattern, s)
[' far far away, dogs ruled the world. The ']

有多个匹配项：

>>> s = 'a time b End time c time d End time'
>>> re.findall(pattern, s)
[' b ', ' d ']

Answer 2

只需删除“曾几何时”＆＃39;并查看剩下的内容。

my_string = 'Once upon a time, in a time far far away, dogs ruled the world. The End.'
if 'time' in my_string.replace('Once upon a time', ''):
    pass

Answer 3

这里的典型解决方案是使用捕获和非捕获正则表达式组。由于正则表达式替换从左到右进行解析，因此首先将任何例外放置到规则中（作为非捕获）并以您要选择的替换结束。

import re

text = "Once upon a time, in a time far far away, dogs ruled the world. The End."
query = re.compile(r"""
  Once upon a time|            # literally 'Once upon a time',
                               # should not be selected
  time\b                       # from the word 'time'
  (.*)                         # capture everything
  \bend                        # until the word 'end'
""", re.X | re.I)

result = query.findall(text)
# result = ['', ' far far away, dogs ruled the world. The ']

你可以删除空组（当我们匹配不需要的字符串时放入）

result = list(filter(None, result))
# or result = [r for r in result if r]
# [' far far away, dogs ruled the world. The ']

然后删除结果

result = list(map(str.strip, filter(None, result)))
# or result = [r.strip() for r in result if r]
# ['far far away, dogs ruled the world. The']

当您有许多短语试图躲闪时，此解决方案特别有用。

phrases = ["Once upon a time", "No time like the present", "Time to die", "All we have left is time"]
querystring = r"time\b(.*)\bend"
query = re.compile("|".join(map(re.escape, phrases)) + "|" + querystring, re.I)

result = [r.strip() for r in query.findall(some_text) if r]

在文本块中查找子字符串，除非它是另一个子字符串的一部分

3 个答案: