我正在尝试解析一些xml,但它包含一些转义字符。有更简单的方法吗?
的xml:
<?xml version="1.0" encoding="UTF-8"?>
<Group id="RHEL-07-010010">
<title>SRG-OS-000257-GPOS-00098</title>
<description><GroupDescription></GroupDescription> </description>
<Rule id="RHEL-07-010010_rule" severity="high" weight="10.0">
<version>RHEL-07-010010</version>
<title>The file permissions, ownership, and group membership of system files and commands must match the vendor values.</title>
<description><VulnDiscussion>Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.
Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278 GPOS-00108</VulnDiscussion>
</Rule>
</Group>
我正在尝试提取描述标记中包含的组ID,规则严重性,标题和VulnDiscussion。我可以得到除VulnDiscussion以外的所有内容,因为它包含转义字符&gt;和&lt;
这是我的代码:
import xml.etree.ElementTree as ET
import HTMLParser
tree = ET.parse("test.xml")
root = tree.getroot()
for findings in root.iter('Group'):
print findings.get('id')
rule = findings.find('Rule')
print rule.get('severity')
print rule.find('title').text
description = rule.find('description')
# my attempt at unescaping the description tag to parse the VulnDiscussion
embeddedHtml = HTMLParser.HTMLParser()
unescapedXML = embeddedHtml.unescape(description)
newtree = ET.fromstring(unescapedXML)
print newtree.get(VulnDiscussion).text
崩溃:
newtree = ET.fromstring(unescapedXML)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions /2.7/lib/python2.7/xml/etree/ElementTree.py", line 1300, in XML
parser.feed(text)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1640, in feed
self._parser.Parse(data, 0)
TypeError: must be string or read-only buffer, not Element
答案 0 :(得分:1)
我建议使用lxml
代替标准库xml
,它更强大,更实用。它甚至可以自动取消文本中的转义符号。使用XPath也可以让您的生活更轻松。
from lxml import etree as ET
xml = ET.XML(b"""<?xml version="1.0" encoding="UTF-8"?>
<Group id="RHEL-07-010010">
<title>SRG-OS-000257-GPOS-00098</title>
<description><GroupDescription></GroupDescription> </description>
<Rule id="RHEL-07-010010_rule" severity="high" weight="10.0">
<version>RHEL-07-010010</version>
<title>The file permissions, ownership, and group membership of system files and commands must match the vendor values.</title>
<description><VulnDiscussion>Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.
Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278 GPOS-00108</VulnDiscussion>
</description>
</Rule>
</Group>""")
for description in xml.xpath('//description/text()'):
vulnDiscussion = next(iter(ET.XML(description).xpath('/VulnDiscussion/text()')), None)
print(vulnDiscussion)
以上代码生成
None
Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.
Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278 GPOS-00108