发布ID到作者列表+引文,Python?

时间:2014-06-10 16:40:26

标签: python biopython

我有一个pubmed id列表,我想提取一个完整作者列表的引文。有这样的在线工具:http://mickschroeder.com/citation/,但作者列表缩写为" et al。"

我试图在Biopython中使用Entrez包来执行此操作,以及xml.etree.ElementTree来解析XML对象。

这就是我所拥有的:

from Bio.Entrez import efetch
import xml.etree.ElementTree as ET

def fetch_abstract(pmid):
    handle = efetch(db='pubmed', id=pmid, retmode='xml')
    xml_data = handle.read()
    print xml_data #this prints the XML data structure correctly

    article = ET.XML(xml_data)

    #problem starts here. I want to create a citation, so I start by trying to
    #get the names of the authors, but I am not sure why this is not working.
    for author_name in article.findall('AuthorValidYN'):
        print author_name

    return 


fetch_abstract(22864638)

XML看起来像这样:

<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2014//EN"      "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_140101.dtd">
<PubmedArticleSet>
<PubmedArticle>
    <MedlineCitation Owner="NLM" Status="MEDLINE">
    <PMID Version="1">22864638</PMID>
    <DateCreated>
        <Year>2012</Year>
        <Month>10</Month>
        <Day>31</Day>
    </DateCreated>
    <DateCompleted>
        <Year>2013</Year>
        <Month>04</Month>
        <Day>23</Day>
    </DateCompleted>
    <Article PubModel="Print">
        <Journal>
            <ISSN IssnType="Electronic">1573-7292</ISSN>
            <JournalIssue CitedMedium="Internet">
                <Volume>11</Volume>
                <Issue>4</Issue>
                <PubDate>
                    <Year>2012</Year>
                    <Month>Dec</Month>
                </PubDate>
            </JournalIssue>
            <Title>Familial cancer</Title>
            <ISOAbbreviation>Fam. Cancer</ISOAbbreviation>
        </Journal>
        <ArticleTitle>No evidence for breast cancer susceptibility associated with variants of BRD7, a component of p53 and BRCA1 pathways.</ArticleTitle>
        <Pagination>
            <MedlinePgn>601-6</MedlinePgn>
        </Pagination>
        <ELocationID EIdType="doi" ValidYN="Y">10.1007/s10689-012-9556-0</ELocationID>
        <Abstract>
            <AbstractText>BRD7 (bromodomain 7), a subunit of poly-bromo-associated BRG1-associated factor (PBAF)-specific Swi/Snf chromatin remodeling complexes, has been proposed as a tumour suppressor protein following its identification as an important component of both functional p53 and BRCA1 (breast cancer 1, early onset) pathways. As low BRD7 expression levels have been linked to p53-wild-type breast tumour cells, we hypothesized an implication of BRD7 germline alterations in the pathogenesis of hereditary breast cancer similar to that of TP53 in Li-Fraumeni syndrome. We performed sequence analysis of the BRD7 gene on 61 high-risk individuals with hereditary or very-early-onset breast cancer and 100 healthy controls. Four potentially disease-causing single-nucleotide alterations were detected within the cohort of breast cancer patients (one listed as a rare single-nucleotide polymorphism (SNP) in the NCBI (National Center for Biotechnology Information) SNP database). Two of the detected variants were also each found once within the control collective. Segregation analysis on both families of those carrying the remaining two variants revealed segregation of these BRD7 alterations independent of breast cancer. In conclusion, it seems that the BRD7 variants we detected represent rare polymorphisms and mainly rule out BRD7 as a frequent high-penetrance breast cancer susceptibility gene. However, further analyses in larger cohorts of women with hereditary breast cancer should clarify the role of BRD7 in breast cancer predisposition.</AbstractText>
        </Abstract>
        <AuthorList CompleteYN="Y">
            <Author ValidYN="Y">
                <LastName>Penkert</LastName>
                <ForeName>Judith</ForeName>
                <Initials>J</Initials>
                <Affiliation>Institute of Cell and Molecular Pathology, Hannover Medical School, Carl-Neuberg-Strasse 1, Hannover, Germany.</Affiliation>
            </Author>
            <Author ValidYN="Y">
                <LastName>Schlegelberger</LastName>
                <ForeName>Brigitte</ForeName>
                <Initials>B</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Steinemann</LastName>
                <ForeName>Doris</ForeName>
                <Initials>D</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Gadzicki</LastName>
                <ForeName>Dorothea</ForeName>
                <Initials>D</Initials>
            </Author>
        </AuthorList>
        <Language>eng</Language>
        <PublicationTypeList>
            <PublicationType>Comparative Study</PublicationType>
            <PublicationType>Journal Article</PublicationType>
            <PublicationType>Research Support, Non-U.S. Gov't</PublicationType>
        </PublicationTypeList>
    </Article>
    <MedlineJournalInfo>
        <Country>Netherlands</Country>
        <MedlineTA>Fam Cancer</MedlineTA>
        <NlmUniqueID>100898211</NlmUniqueID>
        <ISSNLinking>1389-9600</ISSNLinking>
    </MedlineJournalInfo>
    <ChemicalList>
        <Chemical>
            <RegistryNumber>0</RegistryNumber>
            <NameOfSubstance>BRCA1 Protein</NameOfSubstance>
        </Chemical>
        <Chemical>
            <RegistryNumber>0</RegistryNumber>
            <NameOfSubstance>BRCA1 protein, human</NameOfSubstance>
        </Chemical>
        <Chemical>
            <RegistryNumber>0</RegistryNumber>
            <NameOfSubstance>BRD7 protein, human</NameOfSubstance>
        </Chemical>
        <Chemical>
            <RegistryNumber>0</RegistryNumber>
            <NameOfSubstance>Chromosomal Proteins, Non-Histone</NameOfSubstance>
        </Chemical>
        <Chemical>
            <RegistryNumber>0</RegistryNumber>
            <NameOfSubstance>TP53 protein, human</NameOfSubstance>
        </Chemical>
        <Chemical>
            <RegistryNumber>0</RegistryNumber>
            <NameOfSubstance>Tumor Suppressor Protein p53</NameOfSubstance>
        </Chemical>
    </ChemicalList>
    <CitationSubset>IM</CitationSubset>
    <MeshHeadingList>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Adult</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Aged</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">BRCA1 Protein</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Breast Neoplasms</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Case-Control Studies</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Chromosomal Proteins, Non-Histone</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Female</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="Y">Genetic Predisposition to Disease</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Humans</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Male</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Middle Aged</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Mutation</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Pedigree</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Polymorphism, Single Nucleotide</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Prognosis</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Tumor Suppressor Protein p53</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Young Adult</DescriptorName>
        </MeshHeading>
    </MeshHeadingList>
</MedlineCitation>
<PubmedData>
    <History>
        <PubMedPubDate PubStatus="entrez">
            <Year>2012</Year>
            <Month>8</Month>
            <Day>7</Day>
            <Hour>6</Hour>
            <Minute>0</Minute>
        </PubMedPubDate>
        <PubMedPubDate PubStatus="pubmed">
            <Year>2012</Year>
            <Month>8</Month>
            <Day>7</Day>
            <Hour>6</Hour>
            <Minute>0</Minute>
        </PubMedPubDate>
        <PubMedPubDate PubStatus="medline">
            <Year>2013</Year>
            <Month>4</Month>
            <Day>24</Day>
            <Hour>6</Hour>
            <Minute>0</Minute>
        </PubMedPubDate>
    </History>
    <PublicationStatus>ppublish</PublicationStatus>
    <ArticleIdList>
        <ArticleId IdType="doi">10.1007/s10689-012-9556-0</ArticleId>
        <ArticleId IdType="pubmed">22864638</ArticleId>
    </ArticleIdList>
</PubmedData>

2 个答案:

答案 0 :(得分:2)

以下是我以前做同样事情的事情(改为使用BeautifulSoup)。

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(xml_data)

a_recs = []

for tag in soup.findAll("pubmedarticle"): # I'm working with multiple articles in one file
    for a_tag in tag.findAll("author"):
        a_rec = {}
        a_rec['pmid'] = int(tag.pmid.text)
        a_rec['lastname'] = a_tag.lastname.text
        a_rec['forename'] = a_tag.forename.text
        a_rec['suffix'] = a_tag.suffix.text
        a_rec['initials'] = a_tag.initials.text
        a_rec['affiliation'] = a_tag.affiliation.text
        a_recs.append(a_rec)

很多时候,作者姓名的不同部分都将为空,如果您尝试访问每个元素的text属性,您将收到错误,因此您需要在直接访问文本之前检查该属性如果标签没有文本属性,我写了一个简短的函数,默认为None。

答案 1 :(得分:1)

我认为你正在寻找错误的XML节点。 ValidYN是节点Author的属性。所以你应该使用:

for author_name in article.findall('Author')

&#34; Element.findall()只查找带有标签的元素,这些元素是当前元素的直接子元素。&#34;我认为你需要将当前元素设置为AuthorList。像这样的东西

article.find('AuthorList').findall('Author')