Question

我正在寻找一种方法来提取我正在处理的python脚本的标签之间的信息。我已经能够使用正则表达式测试器来隔离我想要的部分，但是re.search方法在我的代码中不起作用。我仅限于使用re.sub方法和split来获取我想要的信息。

我尝试使用re.search并返回错误，所以我一直在使用re.sub方法

 sub = re.sub('<.*?>',' ', line)
 sub = sub.split()

示例字符串：

 <CellValue Index="0"><FormattedValue>System Managed Accounts 
 Group</FormattedValue><Value>System Managed Accounts Group</Value> 
 </CellValue>

上面的函数从正确的位置产生数据，但没有返回所有信息（它停在第一个空格，我如何修改它以获取标签之间的整个文本）

Answer 1

通常，出于这个目的，我宁愿re.findall()胜过re.match()。

您可能没有意识到的是，您可以在正则表达式中使用括号来表示“捕获组”（这样，将忽略组外的所有内容）。一些例子：

sample = '<CellValue Index="0"><FormattedValue>System Managed Accounts Group</FormattedValue><Value>System Managed Accounts Group</Value>  </CellValue>'

insideTags = re.findall(r'<(.*?)>', sample)
# ['CellValue Index="0"', 'FormattedValue', '/FormattedValue', 'Value', '/Value', '/CellValue']

openingTagsOnly = re.findall(r'<([^/]*?)>', sample)
# ['CellValue Index="0"', 'FormattedValue', 'Value']

betweenTags = re.findall(r'<.*?>([^<>]*?)</.*?>', sample)
# ['System Managed Accounts Group', 'System Managed Accounts Group']

如果您要解析HTML / XML，则实际上应该使用类似beautifulsoup的模块-请参见why regex cannot parse HTML/XML。但是对于您提供的非常简单的示例，我的后一个示例的工作原理是仅获取最接近的一对打开/关闭标签之间的所有内容，从而使它们之间没有其他标签。

寻找一种使用python

1 个答案: