Question

我有以下问题。我想从多个文本文件中获取特定字符串，文本文件中有一定的模式。例如

example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"

每个文件都非常不同，但在所有文件中我想要文本1：在'Pear'和'Apple'之间我用以下代码解决了这个问题：

x = re.findall(r'Pear+\s(.*?)Apple', example_file ,re.DOTALL)

返回：

['this should be included1 ', 'this should be included2 ']

我无法找到的想法是我也想要最后的字符串，'这应该被包含3'部分。所以我想知道是否有一种方法可以用正则表达式来指定

 x = re.findall(r'Pear+\s(.*?)Apple OR EOF', example_file ,re.DOTALL)

那么如何匹配“Pear”和EOF（文件末尾）之间的某些内容？请注意，这些都是文本文件（因此不是特定的一个句子）

Answer 1

选择Apple或$（与字符串末尾匹配的锚点）：

x = re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)

|指定两个备选方案，(?:...)是非捕获组，因此解析器知道选择Apple或$作为匹配。

请注意，我将Pear+\s替换为Pear\s+，因为我怀疑您要匹配任意空格，而不是任意数量的r个字符。

演示：

>>> import re
>>> example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"
>>> re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)
['this should be included1 ', 'this should be included2 ', 'this should be included3']

在两个子字符串之间以及字符串和文件末尾之间查找字符串

1 个答案: