Python XML解析子标记

时间:2017-01-12 16:20:25

标签: python xml-parsing lxml

我正在尝试使用lxml获取子标记的内容。我正在解析的XML文件是有效的,但出于某种原因,当我尝试解析子元素时,它似乎认为我有无效的XML。我从其他帖子中看到,当没有结束标记但XML在浏览器中解析时,通常会生成此错误。任何想法为什么会发生这种情况?

XML文件的内容(test.xml):

<?xml version="1.0" encoding="UTF-8"?>
<Group id="RHEL-07-010010">
    <title>SRG-OS-000257-GPOS-00098</title>
    <description>&lt;GroupDescription&gt;&lt;/GroupDescription&gt;   </description>
    <Rule id="RHEL-07-010010_rule" severity="high" weight="10.0">
      <version>RHEL-07-010010</version>
      <title>The file permissions, ownership, and group membership of system files and commands must match the vendor values.</title>
      <description>&lt;VulnDiscussion&gt;Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.

Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278-  GPOS-00108&lt;/VulnDiscussion&gt;&lt;FalsePositives&gt;&lt; /FalsePositives&gt;&lt;FalseNegatives&gt;&lt; /FalseNegatives&gt;&lt;Documentable&gt;false&lt; /Documentable&gt;&lt;Mitigations&gt;&lt; /Mitigations&gt;&lt;SecurityOverrideGuidance&gt;&lt; /SecurityOverrideGuidance&gt;&lt;PotentialImpacts&gt;&lt; /PotentialImpacts&gt;&lt;ThirdPartyTools&gt;&lt; /ThirdPartyTools&gt;&lt;MitigationControl&gt;&lt; /MitigationControl&gt;&lt;Responsibility&gt;&lt; /Responsibility&gt;&lt;IAControls&gt;&lt;/IAControls&gt;</description>
      <ident system="http://iase.disa.mil/cci">CCI-001494</ident>
      <ident system="http://iase.disa.mil/cci">CCI-001496</ident>
      <fixtext fixref="F-RHEL-07-010010_fix">Run the following command to  determine which package owns the file:

# rpm -qf &lt;filename&gt;

Reset the permissions of files within a package with the following command:

#rpm --setperms &lt;packagename&gt;

Reset the user and group ownership of files within a package with the following command:

#rpm --setugids &lt;packagename&gt;</fixtext>
      <fix id="F-RHEL-07-010010_fix" />
      <check system="C-RHEL-07-010010_chk">
        <check-content-ref name="M" href="VMS_XCCDF_Benchmark_SRG.xml" />
            <check-content>Verify the file permissions, ownership, and group  membership of system files and commands match the vendor values.
Check the file permissions, ownership, and group membership of system files and commands with the following command:

# rpm -Va | grep '^.M'

If there is any output from the command, this is a finding.</check-content>
      </check>
    </Rule>
  </Group>

我正在尝试获取VulnDiscussion标记的内容。我可以得到父标签的内容,讨论如下:

from lxml import etree as ET

xml = ET.parse("test.xml")
for description in xml.xpath('//description/text()'):
print(description)

这会产生以下输出:

<GroupDescription></GroupDescription>
<VulnDiscussion>Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.

Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278-GPOS-00108</VulnDiscussion>   <FalsePositives></FalsePositives><FalseNegatives> </FalseNegatives><Documentable>false</Documentable><Mitigations></Mitigations> <SecurityOverrideGuidance></SecurityOverrideGuidance><PotentialImpacts> </PotentialImpacts><ThirdPartyTools></ThirdPartyTools><MitigationControl> </MitigationControl><Responsibility></Responsibility><IAControls></IAControls>

到目前为止一切顺利,现在我尝试使用以下代码提取VulnDiscussion的内容:

for description in xml.xpath('//description/text()'):
    vulnDiscussion = next(iter(ET.XML(description).xpath('//VulnDiscussion/text()')), None)
    print(vulnDiscussion)

并收到以下错误:

 vulnDiscussion = next(iter(ET.XML(description).xpath('//VulnDiscussion/text()')), None)
  File "src/lxml/lxml.etree.pyx", line 3192, in lxml.etree.XML (src/lxml/lxml.etree.c:78763)
  File "src/lxml/parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:118341)
  File "src/lxml/parser.pxi", line 1736, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:117021)
  File "src/lxml/parser.pxi", line 1102, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:111265)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105109)
  File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106817)
  File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105671)
  File "<string>", line 3
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 3,  column 79

1 个答案:

答案 0 :(得分:1)

XML只能有一个“root”,xml.xpath('// description / text()')返回多个元素。将所有元素包装到单个元素中,然后XML文档将只有一个根元素。

另请注意,原始XML中的文本在每个要删除的结束标记之前都有一个空格

from lxml import etree as ET

xml = ET.parse("test.xml")

    for description in xml.xpath('//description/text()'):
    x = ET.XML('<Testroot>'+description.replace('< /','</')+'</Testroot>') # add root tag and remove space before the closing tag
    vulnDiscussion = next(iter(x.xpath('//VulnDiscussion/text()')), None)
    if vulnDiscussion:
        print(vulnDiscussion)

输出

    Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.

    Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278-  GPOS-00108