Question

我无法解析XML文件（它是GC历史记录）。 XML示例如下所示。

<?xml version="1.0" ?>

<verbosegc xmlns="http://www.ibm.com/j9/verbosegc" version="R28_jvm.28_20150612_0201_B252774_CMPRSS">

<initialized id="1" timestamp="2015-12-04T20:17:07.219">
  <attribute name="gcPolicy" value="-Xgcpolicy:gencon" />
  <attribute name="maxHeapSize" value="0x20000000" />
  <attribute name="initialHeapSize" value="0x400000" />
</initialized>

<cycle-start id="4" type="scavenge" contextid="0" timestamp="2015-12-04T20:17:10.677" intervalms="3457.977" />
<gc-start id="5" type="scavenge" contextid="4" timestamp="2015-12-04T20:17:10.677">
  <mem-info id="6" free="3037768" total="4194304" percent="72">
  </mem-info>
</gc-start>
<gc-end id="8" type="scavenge" contextid="4" durationms="0.807" usertimems="0.000" systemtimems="0.000" timestamp="2015-12-04T20:17:10.678" activeThreads="2">
  <mem-info id="9" free="3163968" total="4194304" percent="75">
  </mem-info>
</gc-end>
<cycle-end id="10" type="scavenge" contextid="4" timestamp="2015-12-04T20:17:10.678" />
<cycle-start id="16" type="scavenge" contextid="0" timestamp="2015-12-04T20:17:10.742" intervalms="64.838" />
<gc-start id="17" type="scavenge" contextid="16" timestamp="2015-12-04T20:17:10.742">
  <mem-info id="18" free="3037664" total="4194304" percent="72">
  </mem-info>
</gc-start>
 <gc-end id="20" type="scavenge" contextid="16" durationms="0.649" usertimems="0.000" systemtimems="0.000" timestamp="2015-12-04T20:17:10.743" activeThreads="2">
  <mem-info id="21" free="3110592" total="4194304" percent="74">
  </mem-info>
</gc-end>
<cycle-end id="22" type="scavenge" contextid="16" timestamp="2015-12-04T20:17:10.743" />
<allocation-satisfied id="23" threadId="0000000002E10500" bytesRequested="416" />


</verbosegc>

我想在 gc-start 和 gc-end 中使用mem-info :: free，这两者都包含在cycle-start和cycle-end标签中，拥有相同的 contexid 。例如，前两个mem-info值为3037768和3163968，相应的 contextid 为4，等于 cycle-start id。有了这些数据，我可以绘制图来显示内存占用量。

对我来说，主要的问题是我无法使用XML parse python中的方法成功解析XML。 getroot工作，但所有其他find / findall返回空。还有其他解决方案吗？感谢

以下是我的尝试：

>>> tree = ET.parse('gc.trace')
>>> tree
<xml.etree.ElementTree.ElementTree object at 0x7fdfaddc19d0>
>>> root=tree.getroot()
>>> root
<Element '{http://www.ibm.com/j9/verbosegc}verbosegc' at 0x7fdfaddc1a90>
>>> cycle_start = root.findall('cycle-start')
>>> cycle_start
[]                ； Empty???
>>> cycle_start = root.findall('mem-info')
>>> print cycle_start
[]                 ;Empty???
>>> 
>>> cycle_start = root.find('mem-info')
>>> cycle_start
>>> print cycle_start
None

from lxml import etree
tree = etree.parse("gc.log")
root = tree.getroot()
>>root.findall('mem-info', root.nsmap)

>>> root.nsmap
{None: 'http://www.ibm.com/j9/verbosegc'}

Answer 1

那是因为你的XML在这里有默认命名空间：

xmlns="http://www.ibm.com/j9/verbosegc"

请注意，descendant元素隐式地继承了祖先的默认命名空间。您可以使用prefix-to-namespace映射来获取命名空间中的元素，例如：

ns = {'d': 'http://www.ibm.com/j9/verbosegc'}
cycle_starts = root.findall('d:cycle-start', namespaces=ns)
print(cycle_starts)

mem_infos = root.findall('d:gc-start/d:mem-info', namespaces=ns)
print(mem_infos)

输出

[<Element '{http://www.ibm.com/j9/verbosegc}cycle-start' at 0x29ae6a0>, <Element '{http://www.ibm.com/j9/verbosegc}cycle-start' at 0x29ae8d0>]
[<Element '{http://www.ibm.com/j9/verbosegc}mem-info' at 0x29ae780>, <Element '{http://www.ibm.com/j9/verbosegc}mem-info' at 0x29ae9b0>]

更新：

回应您的评论，这是避免命名空间硬编码的一种可能方法：

#map default namespace uri to prefix d without hard-coding:
ns = {'d': root.nsmap[None]}
result = root.findall('.//d:mem-info', namespaces=ns)

另外，我建议使用xpath()方法而不是findall()，因为前者提供了对标准XPath 1.0表达式的更好支持，这在更复杂的情况下会很有用。

Python解析XML有多个根

1 个答案: