在两个子字符串之间以及字符串和文件末尾之间查找字符串

时间:2017-01-20 13:34:28

标签: python regex

我有以下问题。我想从多个文本文件中获取特定字符串,文本文件中有一定的模式。例如

example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"

每个文件都非常不同,但在所有文件中我想要文本1:在'Pear'和'Apple'之间我用以下代码解决了这个问题:

x = re.findall(r'Pear+\s(.*?)Apple', example_file ,re.DOTALL)

返回:

['this should be included1 ', 'this should be included2 ']

我无法找到的想法是我也想要最后的字符串,'这应该被包含3'部分。所以我想知道是否有一种方法可以用正则表达式来指定

 x = re.findall(r'Pear+\s(.*?)Apple OR EOF', example_file ,re.DOTALL)

那么如何匹配“Pear”和EOF(文件末尾)之间的某些内容?请注意,这些都是文本文件(因此不是特定的一个句子)

1 个答案:

答案 0 :(得分:4)

选择Apple$(与字符串末尾匹配的锚点):

x = re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)

|指定两个备选方案,(?:...)是非捕获组,因此解析器知道选择Apple$作为匹配。

请注意,我将Pear+\s替换为Pear\s+,因为我怀疑您要匹配任意空格,而不是任意数量的r个字符。

演示:

>>> import re
>>> example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"
>>> re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)
['this should be included1 ', 'this should be included2 ', 'this should be included3']