Question

我有一个XML文件，格式如下：

<doc>
<id name="X">
  <type name="A">
    <min val="100" id="80"/>
    <max val="200" id="90"/>
   </type>
  <type name="B">
    <min val="100" id="20"/>
    <max val="20" id="90"/>
  </type>
</id>

<type...>
</type>
</doc>

我想解析这个文档并构建一个哈希表

{X: {"A": [(100,80), (200,90)], "B": [(100,20), (20,90)]}, Y: .....}

我将如何在Python中执行此操作？

Answer 1

我不同意其他使用minidom的答案中的建议 - 这是一种基本上适用于其他语言的标准的Python改编版，可用但不是很合适。现代Python中推荐的方法是ElementTree。

在第三方模块lxml中也可以更快地实现相同的界面，但除非你需要超快的速度，否则Python标准库中包含的版本很好（并且比minidom还要快） - 关键点是编程到该接口，然后如果你愿意，你可以随时切换到同一个接口的不同实现，只需对你自己的代码进行最小的更改。

例如，在所需的导入＆amp; c之后，以下代码是您的示例的最小实现（它不验证XML是否正确，只是假设正确性提取数据 - 添加各种检查是相当的当然容易）：

from xml.etree import ElementTree as et  # or, import any other, faster version of ET

def xml2data(xmlfile):
  tree = et.parse(xmlfile)
  data = {}
  for anid in tree.getroot().getchildren():
    currdict = data[anid.get('name')] = {}
    for atype in anid.getchildren():
      currlist = currdict[atype.get('name')] = []
      for c in atype.getchildren():
        currlist.append((c.get('val'), c.get('id')))
  return data

根据您的样本输入，这会产生您想要的结果。

Answer 2

不要重新发明轮子。使用Amara工具包。无论如何，变量名只是字典中的键。 http://www.xml3k.org/Amara

Answer 3

我建议使用minidom库。

文档非常好，所以你应该立即启动并运行。

丹

Answer 4

正如其他人所说，minidom是要走的路。打开（并解析）文件，同时通过节点检查是否相关并且应该读取。这样，您也知道是否要读取子节点。

把这个放在一起，似乎做你想要的。某些值由属性位置而不是属性名称读取。而且没有错误处理。最后的print（）意味着它的Python 3.x。

我会把它作为一项练习来改进，只是想发布一个片段来帮助你开始。

快乐的黑客攻击！：）

<强> xml.txt

<doc>
<id name="X">
  <type name="A">
    <min val="100" id="80"/>
    <max val="200" id="90"/>
   </type>
  <type name="B">
    <min val="100" id="20"/>
    <max val="20" id="90"/>
  </type>
</id>
</doc>

<强> parsexml.py

from xml.dom import minidom
data={}
doc=minidom.parse("xml.txt")
for n in doc.childNodes[0].childNodes:
    if n.localName=="id":
        id_name = n.attributes.item(0).nodeValue
        data[id_name] = {}
        for j in n.childNodes:
            if j.localName=="type":
                type_name = j.attributes.item(0).nodeValue
                data[id_name][type_name] = [(),()]
                for k in j.childNodes:
                    if k.localName=="min":
                        data[id_name][type_name][0] = \
                            (k.attributes.item(1).nodeValue, \
                             k.attributes.item(0).nodeValue)
                    if k.localName=="max":
                        data[id_name][type_name][1] = \
                            (k.attributes.item(1).nodeValue, \
                             k.attributes.item(0).nodeValue)
print (data)

<强>输出：

{'X': {'A': [('100', '80'), ('200', '90')], 'B': [('100', '20'), ('20', '90')]}}

Answer 5

另一个XML解析库：http://www.crummy.com/software/BeautifulSoup/

解析XML文档从此处开始：http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing%20XML

Answer 6

为什么不尝试类似PyXml库的内容。他们有很多文档和教程。

将XML解析为哈希表

6 个答案: