Question

我使用API来获取一些XML文件，但其中一些包含HTML标记而不会转义它们。例如，<br>或<b></b>

我使用此代码来读取它们，但带有HTML的文件会引发错误。我没有权限手动更改所有文件。有没有办法解析文件而不丢失HTML标签？

from xml.dom.minidom import parse, parseString

xml = ...#here is the api to receive the xml file
dom = parse(xml)
strings = dom.getElementsByTagName("string")

Answer 1

如果您可以使用第三方库，我建议您使用Beautiful Soup它可以处理xml和html，还可以解析损坏的标记，同时提供易于使用的API。

Answer 2

以字符串形式读取xml文件，并在解析之前修复格式错误的标记：

import xml.etree.ElementTree as ET

with open(xml) as xml_file: # open the xml file for reading
    text= xml_file.read() # read its contents
text= text.replace('<br>', '<br />') # fix malformed tags
document= ET.fromstring(text) # parse the string
strings= document.findall('string') # find all string elements

Python使用HTML内容解析XML文件

2 个答案: