Question

我有一个类似于以下内容的XML结构，但规模要大得多：

<root>
    <conference name='1'>
        <author>
            Bob
        </author>
        <author>
            Nigel
        </author>
    </conference>
    <conference name='2'>
        <author>
            Alice
        </author>
        <author>
            Mary
        </author>
    </conference>
</root>

为此，我使用了以下代码：

dom = parse(filepath)
conference=dom.getElementsByTagName('conference')
for node in conference:
    conf_name=node.getAttribute('name')
    print conf_name
    alist=node.getElementsByTagName('author')
    for a in alist:
        authortext= a.nodeValue
        print authortext

但是，打印出来的authortext是“None”。我尝试使用如下所示的变体，但它会导致我的程序崩溃。

authortext=a[0].nodeValue

正确的输出应该是：

1
Bob
Nigel
2
Alice
Mary

但我得到的是：

1
None
None
2
None
None

有关如何解决此问题的任何建议？

Answer 1

您的authortext类型为1（ELEMENT_NODE），通常您需要TEXT_NODE来获取字符串。这将有效

a.childNodes[0].nodeValue

Answer 2

元素节点没有nodeValue。您必须查看其中的Text节点。如果您知道里面总有一个文本节点，则可以说element.firstChild.data（数据与文本节点的nodeValue相同）。

注意：如果没有文本内容，则没有子Text节点，element.firstChild将为null，导致.data访问失败。

快速获取直接子文本节点的内容：

text= ''.join(child.data for child in element.childNodes if child.nodeType==child.TEXT_NODE)

在DOM Level 3 Core中，您可以使用textContent属性来递归地从Element内部获取文本，但minidom不支持此（其他一些Python DOM实现）。

Answer 3

快速访问：

node.getElementsByTagName('author')[0].childNodes[0].nodeValue

Answer 4

由于每个作者总是有一个文本数据值，因此可以使用element.firstChild.data

dom = parseString(document)
conferences = dom.getElementsByTagName("conference")

# Each conference here is a node
for conference in conferences:
    conference_name = conference.getAttribute("name")
    print 
    print conference_name.upper() + " - "

    authors = conference.getElementsByTagName("author")
    for author in authors:
        print "  ", author.firstChild.data
    # for

    print

Answer 5

我玩了一下，这就是我的工作：

# ...
authortext= a.childNodes[0].nodeValue
print authortext

导致输出：

C:\temp\py>xml2.py
1
Bob
Nigel
2
Alice
Mary

我无法确切地告诉您为什么必须访问childNode以获取内部文本，但至少这是您要查找的内容。

使用Python minidom读取XML并迭代每个节点

5 个答案: