使用python从xml标记中提取数据

时间:2019-06-03 10:55:09

标签: python xml

我希望使用python从记录标签中的.xml文件中提取一些ID(doi,pmcid和pmid):

xml文件:

<pmcids status="ok">
    <request idtype="doi" dois="" versions="yes" showaiid="no">
        <warning>no e-mail provided</warning>
        <warning>no tool provided</warning>
        <echo>ids=10.1371%2Fjournal.pone.0054577</echo>
    </request>
    <record requested-id="10.1371/JOURNAL.PONE.0054577"     pmcid="PMC3557238" pmid="23382917" doi="10.1371/journal.pone.0054577">
        <versions><version pmcid="PMC3557238.1" current="true"/>
        </versions>
    </record>
</pmcids>

我尝试了以下python代码:

import xml.etree.cElementTree as etree

xmlDoc = open('garbage_collector/tmp.xml', 'r')
xmlDocData = xmlDoc.read()
xmlDocTree = etree.XML(xmlDocData)

for ingredient in xmlDocTree.iter('record'):
    print ingredient[0].text

我希望将pmcid,doi和pmid作为字符串形式输出

1 个答案:

答案 0 :(得分:0)

如果可以使用BeautifulSoup,则可以

from bs4 import BeautifulSoup
soup = BeautifulSoup(input_xml)
t = soup.find('record')

其中input_xml是要以字符串形式检查的xml。

我们使用record函数找到find()标签并将其存储在变量t中。现在可以通过索引<record>来访问t标记的属性。

print(t['pmcid'])
print(t['doi'])
print(t['pmid'])

将打印

PMC3557238
10.1371/journal.pone.0054577
23382917