我有以下问题。我想从多个文本文件中获取特定字符串,文本文件中有一定的模式。例如
example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"
每个文件都非常不同,但在所有文件中我想要文本1:在'Pear'和'Apple'之间我用以下代码解决了这个问题:
x = re.findall(r'Pear+\s(.*?)Apple', example_file ,re.DOTALL)
返回:
['this should be included1 ', 'this should be included2 ']
我无法找到的想法是我也想要最后的字符串,'这应该被包含3'部分。所以我想知道是否有一种方法可以用正则表达式来指定
x = re.findall(r'Pear+\s(.*?)Apple OR EOF', example_file ,re.DOTALL)
那么如何匹配“Pear”和EOF(文件末尾)之间的某些内容?请注意,这些都是文本文件(因此不是特定的一个句子)
答案 0 :(得分:4)
选择Apple
或$
(与字符串末尾匹配的锚点):
x = re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)
|
指定两个备选方案,(?:...)
是非捕获组,因此解析器知道选择Apple
或$
作为匹配。
请注意,我将Pear+\s
替换为Pear\s+
,因为我怀疑您要匹配任意空格,而不是任意数量的r
个字符。
演示:
>>> import re
>>> example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"
>>> re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)
['this should be included1 ', 'this should be included2 ', 'this should be included3']