Question

我正在使用Python尝试从这个旧代码中提取数据，感兴趣的内容不在整齐的HTML标记之间，而是在字符串之间，包括标点符号和字母。虽然我得到了初始字符串的第一个实例和最终边界字符串的最后一个实例之间的所有内容，但不是获取每个内容。例如：

>>> q = '"text:"content_of_interest_1",body, code code "text:":content_of_interest_2",body'

>>> start1 = '"text:"'

>>> end1 = '",body'

>>> print q[q.find(start1)+len(start1):q.rfind(end1)]
content_of_interest_1",body, code code "text:":content_of_interest_2

我想要找出start1和end1所包含的每个内容实例，即：

content_of_interest_1, content_of_interest_2

如何重新编写代码以获取字符串限制内容的每个实例，而不是如上所述的所有有限内容？

Answer 1

对于第一个子字符串，您需要q.find使用end1而不是rfind，而对于最后一个字符串，您需要使用rfind：

>>> q[q.find(start1)+len(start1):q.find(end1)]
'content_of_interest_1'
>>> q[q.rfind(start1)+len(start1):q.rfind(end1)]
':content_of_interest_2'

但使用find只会为您提供第一次出现start和end的索引。因此，作为这些任务的更合适的方法，您可以简单地使用正则表达式：

>>> re.findall(r':"(.*?)"',q)
['content_of_interest_1', ':content_of_interest_2']

Answer 2

您可以将正则表达式与positive lookehind

一起使用

import re
re.findall(r'(?<="text:"):?\w+', q)
#['content_of_interest_1', ':content_of_interest_2']

在非标记字符串之间搜索内容

2 个答案: