Question

我有一个xml文件：

<uniprot created="2010-12-20">
 <entry dataset="abc">
    <references id="1">
        <title>first references</title>
        <author>
            <person name="Mr. A"/>
            <person name="Mr. B"/>
            <person name="Mr. C"/>
        </author>
        <scope> scope 1 for id 1 </scope>
        <scope> scope 2 for id 1 </scope>
        <scope> scope 2 for id 1 </scope>
    </references>
    <references id="2">
        <title>Second references</title>
        <author>
            <person name="Mr. D"/>
            <person name="Mr. E"/>
            <person name="Mr. F"/>
        </author>
        <scope> scope 1 for id 2 </scope>
        <scope> scope 2 for id 2 </scope>
        <scope> scope 3 for id 2 </scope>
    </references>
    <references id="3">
        <title>third references</title>
        <author>
            <person name="Mr. G"/>
            <person name="Mr. H"/>
            <person name="Mr. I"/>
        </author>
        <scope> scope 1 for id 3 </scope>
        <scope> scope 2 for id 3 </scope>
        <scope> scope 3 for id 3 </scope>
    </references>
    <references id="4">
        <title>fourth references</title>
        <author>
            <person name="Mr. J"/>
            <person name="Mr. K"/>
            <person name="Mr. L"/>
        </author>
        <scope> scope 1 for id 4 </scope>
        <scope> scope 2 for id 4 </scope>
        <scope> scope 3 for id 4 </scope>
    </references>
  </entry>
</uniprot>

我希望以特定格式显示此xml中的所有引用：输出：

First Reference
Mr A, Mr B, Mr C
Scope 1 for id 1, Scope 2 for id 1, Scope 3 for id 1

Second Reference
Mr D, Mr E, Mr F
Scope 1 for id 2, Scope 2 for id 2, Scope 3 for id 2

Third Reference
Mr G, Mr H, Mr I
Scope 1 for id 3, Scope 2 for id 3, Scope 3 for id 3

Fourth Reference
Mr J, Mr K, Mr L
Scope 1 for id 4, Scope 2 for id 4, Scope 3 for id 4

我已经编写了我的代码，并且能够以正确的格式获取标题的值，但我无法专门为每个条目获取作者信息。

import xml.etree.ElementTree as ET
document = ET.parse("recipe.xml")
root = document.getroot()
title=[]
author=[]
scope=[]  

for i in root.getiterator('title'):
     title.append(i.text)
     for j in root.getiterator('author'):
          author.append(j.text)
           for k in root.getiterator('scope'):
                scope.append(k.text) 

for i,j,k in zip(title,author,scope):
      print i,j,k

Answer 1

因为作者和＃39;名称存储在name标记的person属性中，也让我们使用dict存储每个reference数据，如下所示：

references = []
for i in root.getiterator('title'):
    reference = {
        'title': i.text,
        'authors': [],
        'scopes': [],    
    }

    for j in root.getiterator('author'):
        for person in root.getiterator('person'):
            reference['authors'].append(person.get('name'))

        for k in root.getiterator('scope'):
            reference['scopes'].append(k.text)

最后，你会得到一个像这样的词典列表：

[
    {
        'title': 'Something',
        'authors': [
            'Author 1',
            'Author 2',
        ],
        'scopes': [
            'scope 1',
            'scope 2',
        ]
    }
]

Answer 2

使用LXML和xpath：

import lxml
from lxml.etree import fromstring,tostring
# x has the xml
x = fromstring(x)

def print_references(ref_node):
    authors = " ".join([t for t in ref_node.xpath('author/person/@name')])
    scope = ", ".join([t.text for t in ref_node.xpath('scope')])
    ref = next(iter(ref_node.xpath('@id')),None)
    print "{} Reference\n{}\n{}\n".format(ref, authors, scope.lstrip())

references = x.xpath('//references')
for ref in references:
    print_references(ref)

输出：

1 Reference
Mr. A Mr. B Mr. C
scope 1 for id 1 ,  scope 2 for id 1 ,  scope 2 for id 1

2 Reference
Mr. D Mr. E Mr. F
scope 1 for id 2 ,  scope 2 for id 2 ,  scope 3 for id 2

3 Reference
Mr. G Mr. H Mr. I
scope 1 for id 3 ,  scope 2 for id 3 ,  scope 3 for id 3

4 Reference
Mr. J Mr. K Mr. L
scope 1 for id 4 ,  scope 2 for id 4 ,  scope 3 for id 4

在python中检索xml数据

2 个答案: