我是R的新手。我已经从NCBI下载了所有生物项目的XML。该文件的大小为1GB。我从这里开始:
setwd("C://Users/USER/Desktop/")
xmlfile = xmlParse("bioproject.xml")
root = xmlRoot(xmlfile)
xmlName(root)
[1] "PackageSet"
xmlSize(root)
[1] 357935
因此,NCBI中有357935个项目。在这里,我正在看项目34:
> root[[34]]
<Package>
<Project>
<Project>
<ProjectID>
<ArchiveID accession="PRJNA44" archive="NCBI" id="44"/>
</ProjectID>
<ProjectDescr>
<Name>Bartonella quintana str. Toulouse</Name>
<Title>Causes bacillary angiomatosis</Title>
<Description><P><B><I>Bartonella quintana</I> str. Toulouse</B>. <I>Bartonella quintana</I> str. Toulouse was isolated from human blood in Toulouse, France in 1993. There is evidence of extensive genome reduction in comparison to other <I>Bartonella</I> species which may be associated with the limited host range of <I>Bartonella quintana</I>.</Description>
<ExternalLink category="Other Databases" label="GOLD">
<URL>http://genomesonline.org/cgi-bin/GOLD/bin/GOLDCards.cgi?goldstamp=Gc00191</URL>
</ExternalLink>
<Publication date="2004-06-24T00:00:00Z" id="15210978" status="ePublished">
<Reference/>
<DbType>ePubmed</DbType>
</Publication>
<ProjectReleaseDate>2004-06-25T00:00:00Z</ProjectReleaseDate>
<LocusTagPrefix assembly_id="GCA_000046685" biosample_id="SAMEA3138248">BQ</LocusTagPrefix>
</ProjectDescr>
<ProjectType>
...
...
</ProjectType>
</Project>
<Submission submitted="2003-03-20">
...
...
</Submission>
<ProjectLinks>
...
...
</ProjectLinks>
</Project>
</Package>
我需要获取整个XML文件中的所有<ProjectID>
值(在本例中为PRJNA44),只有在每个项目的<Description>
中的<ProjectDescr>
中存在IF文字“与人血隔离”(或“血液”,如果这样会使脚本更简单)。或者,如果使它更简单,则可以获取<URL>
中<ExternalLink
中的<ProjectDescr>
值,而不是获取ProjectID。
我不知道如何(或是否)使用xpath
函数(或xpathApply
或getNodeSet
或xpathSApply
)。谢谢您的帮助。
答案 0 :(得分:0)
这是一个非常简单的问题,上面有很多示例。
我发现“ xml2”包的语法比“ XML”包更易于使用。
一个项目节点上方的示例是另一个标记为project的节点的子节点,如果尝试选择此节点,可能会导致问题。为了找到我为项目节点解析的正确节点,将其作为项目的子节点。
library(xml2)
library(dplyr)
#read xml document
page<-read_xml("bioproject.xml")
#find all of the project nodes
projectnodes<-xml_find_all(page, ".//Project/Project")
#loop through all of the nodes and extract the requested information
dfs<-lapply(projectnodes, function(node) {
#find description text
description<-xml_find_first(node, ".//Description") %>% xml_text()
#find the URL link
link<-xml_find_first(node, ".//URL") %>% xml_text()
#find project ID
projid<-xml_find_first(node, ".//ArchiveID") %>% xml_attr("accession")
#store data into individual data frames
df<-data.frame(projid, link, description, stringsAsFactors = FALSE)
})
#bind all of the rows together into a single final data frame
answer<-bind_rows(dfs)
#find rows with the keyword using regular expressions.
answer[grep("blood", answer$description),]