Question

这似乎是一项简单的任务，但我已经有足够的时间进入这里，最后寻求帮助：

我有一个长文本文件，大致有这种格式：

开始测试xyz：

多行等等等等等等
开始测试wzy：

多行等等等等等等
开始测试qqq：

多行等等等等等等

我希望在＆＃34;测试开始之后抓住所有东西＆＃34;减速，这个表达式得到了我所需要的一半：

re.findall(r'Start of test(.+?)Start of test', curfile, re.S)

最明显的问题是我正在消耗我接下来要搜索的内容，因此产生了我想要的大约一半的结果。假设我可以避免这种情况，我仍然无法弄清楚如何在没有＆＃34;开始测试＆＃34;结束比赛。

我认为我需要使用否定前瞻断言，但我没有太多运气找出使用它们的正确方法，我一直在尝试这样的事情：

re.findall(r'Start of test(.+?)(?!Start of test)

没有给出有用的结果。

Answer 1

我认为这是您正在寻找的模式

Start of test(.+?)(?=Start of test|$)

然后你的新代码应该是

re.findall(r'Start of test(.+?)Start of test', curfile, re.S)

请参阅demo

Answer 2

你想要一个超前模式。请参阅https://docs.python.org/2/library/re.html所描述的(?= ... )：

<强> (?=...)
匹配如果...匹配下一个，但不消耗任何字符串。这称为先行断言。例如，Isaac (?=Asimov)只有在'Isaac '之后才匹配'Asimov'。

所以对你的情况来说：

re.findall(r'Start of test(.+?)(?=Start of test)', curfile, re.S)

但这必须通过非贪婪的评估来缓和。

Answer 3

使用re.finditer获取可匹配的匹配对象可能更有用，然后在每个匹配对象上使用mo.start(0)来找出当前匹配在原始字符串中的位置。然后，您可以通过以下方式恢复匹配之间的所有内容 - 请注意我的模式只匹配单个“测试开始”行：

pattern = r'^Start of test (.*):$'
matches = re.finditer(pattern, curfile, re.M)
i = 0  # where the last match ended
names = []
in_between = []
for mo in matches:
    j = mo.start(0)
    in_between = curfile[i:j]  # store what came before this match
    i = mo.end(0)  # store the new "end of match" position
    names.append(mo.group(1))  # store the matched name
in_between.append(curfile[i:])  # store the rest of the file

# in_between[0] is what came before the first test
chunks = in_between[1:]

Python正则表达式，重复数据

3 个答案: