使用XPATH从大型NCBI XML文件获取价值

时间:2019-03-18 17:02:25

标签: r xml xpath genome ncbi

我是R的新手。我已经从NCBI下载了所有生物项目的XML。该文件的大小为1GB。我从这里开始:

setwd("C://Users/USER/Desktop/")
xmlfile = xmlParse("bioproject.xml")
root = xmlRoot(xmlfile)
xmlName(root)
[1] "PackageSet"
xmlSize(root)
[1] 357935

因此,NCBI中有357935个项目。在这里,我正在看项目34:

> root[[34]]
<Package>
  <Project>
    <Project>
      <ProjectID>
        <ArchiveID accession="PRJNA44" archive="NCBI" id="44"/>
      </ProjectID>
      <ProjectDescr>
        <Name>Bartonella quintana str. Toulouse</Name>
        <Title>Causes bacillary angiomatosis</Title>
        <Description>&lt;P&gt;&lt;B&gt;&lt;I&gt;Bartonella quintana&lt;/I&gt; str. Toulouse&lt;/B&gt;. &lt;I&gt;Bartonella quintana&lt;/I&gt; str. Toulouse was isolated from human blood in Toulouse, France in 1993. There is evidence of extensive genome reduction in comparison to other &lt;I&gt;Bartonella&lt;/I&gt; species which may be associated with the limited host range of &lt;I&gt;Bartonella quintana&lt;/I&gt;.</Description>
        <ExternalLink category="Other Databases" label="GOLD">
          <URL>http://genomesonline.org/cgi-bin/GOLD/bin/GOLDCards.cgi?goldstamp=Gc00191</URL>
        </ExternalLink>
        <Publication date="2004-06-24T00:00:00Z" id="15210978" status="ePublished">
          <Reference/>
          <DbType>ePubmed</DbType>
        </Publication>
        <ProjectReleaseDate>2004-06-25T00:00:00Z</ProjectReleaseDate>
        <LocusTagPrefix assembly_id="GCA_000046685" biosample_id="SAMEA3138248">BQ</LocusTagPrefix>
      </ProjectDescr>   
      <ProjectType>
        ...
        ...
      </ProjectType>
    </Project>
    <Submission submitted="2003-03-20">
      ...
      ...
    </Submission>
    <ProjectLinks>
      ...
      ...
    </ProjectLinks>
  </Project>
</Package>

我需要获取整个XML文件中的所有<ProjectID>值(在本例中为PRJNA44),只有在每个项目的<Description>中的<ProjectDescr>中存在IF文字“与人血隔离”(或“血液”,如果这样会使脚本更简单)。或者,如果使它更简单,则可以获取<URL><ExternalLink中的<ProjectDescr>值,而不是获取ProjectID。

我不知道如何(或是否)使用xpath函数(或xpathApplygetNodeSetxpathSApply)。谢谢您的帮助。

1 个答案:

答案 0 :(得分:0)

这是一个非常简单的问题,上面有很多示例。
我发现“ xml2”包的语法比“ XML”包更易于使用。

一个项目节点上方的示例是另一个标记为project的节点的子节点,如果尝试选择此节点,可能会导致问题。为了找到我为项目节点解析的正确节点,将其作为项目的子节点。

library(xml2)
library(dplyr)

#read xml document
page<-read_xml("bioproject.xml")

#find all of the project nodes
projectnodes<-xml_find_all(page, ".//Project/Project")

#loop through all of the nodes and extract the requested information
dfs<-lapply(projectnodes, function(node) {
   #find description text
   description<-xml_find_first(node, ".//Description") %>% xml_text()
   #find the URL link
   link<-xml_find_first(node, ".//URL") %>% xml_text()
   #find project ID 
   projid<-xml_find_first(node, ".//ArchiveID") %>% xml_attr("accession")
   #store data into individual data frames
   df<-data.frame(projid, link, description, stringsAsFactors = FALSE)
})  


#bind all of the rows together into a single final data frame
answer<-bind_rows(dfs)

#find rows with the keyword using regular expressions.
answer[grep("blood", answer$description),]