我想从xml文件(如下所示的示例文件)中读取PMID和作者姓氏
我正在获取PMID和姓氏,但是将循环作为PMID的次数,我想要1个PMID并有各自的姓氏
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE PubmedArticleSet SYSTEM "http://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">2844048</PMID>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Guarner</LastName>
<ForeName>J</ForeName>
<Initials>J</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Cohen</LastName>
<ForeName>C</ForeName>
<Initials>C</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Mushi</LastName>
<ForeName>E</ForeName>
<Initials>F</Initials>
</Author>
</AuthorList>
</MedlineCitation>
</PubmedArticle>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">123456</PMID>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Smith</LastName>
<ForeName>C</ForeName>
<Initials>C</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Jones</LastName>
<ForeName>E</ForeName>
<Initials>F</Initials>
</Author>
</AuthorList>
</MedlineCitation>
</PubmedArticle>
</PubmedArticleSet>
代码,我已经尝试
FN=[]
for pmid in root.iter('PMID'):
print(pmid.text)
for id in root.findall("./PubmedArticle/MedlineCitation/Article/AuthorList"):
for f in id.findall("./Author/ForeName"):
fn=f.text
x= '{},{}'.format(i, fn)
#print(x)
FN.append(x)
预期产量
PMID AUTHORS
2844048 'Guarner J J', 'Cohen C C'
答案 0 :(得分:0)
我不知道您是否希望输出采用特定格式。但是,您可以尝试以下代码。输出为Dictionary,其中Keys为PMID,Values为Authors列表。
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('E:\Python\DataFiles\PMID.xml') # change according to your location
authors_pmid = []
all_authors_pmid = []
root = tree.getroot()
for amedlinecitation in root.iter('MedlineCitation'): #PMID and Author are childs of MedlineCitation
pmid = amedlinecitation.find('PMID').text
for anauthor in amedlinecitation.iter('Author'): # for each amedlinecitation, find all its Authors
author_name = anauthor.find('LastName').text # for each Author, find the LastName tag and extract its value
authors_pmid = [pmid,author_name]
all_authors_pmid.append(authors_pmid)
df = pd.DataFrame(all_authors_pmid,columns=['PMID','Author'])
print(df)
输出:
{'2844048': ['Guarner', 'Cohen', 'Mushi'], '123456': ['Smith', 'Jones']}
以下代码将使用Python数据框以表格形式给出输出。
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('E:\Python\DataFiles\PMID.xml') # change according to your location
authors_pmid = []
all_authors_pmid = []
root = tree.getroot()
for amedlinecitation in root.iter('MedlineCitation'): #PMID and Author are childs of MedlineCitation
pmid = amedlinecitation.find('PMID').text
for anauthor in amedlinecitation.iter('Author'): # for each amedlinecitation, find all its Authors
author_name = anauthor.find('LastName').text # for each Author, find the LastName tag and extract its value
authors_pmid = [pmid,author_name]
all_authors_pmid.append(authors_pmid)
df = pd.DataFrame(all_authors_pmid,columns=['PMID','Author'])
print(df)
输出:
PMID Author
0 2844048 Guarner
1 2844048 Cohen
2 2844048 Mushi
3 123456 Smith
4 123456 Jones
以上代码与第一个代码有何不同: