我正在解析从ClinicalTrials.gov下载的xml文件的目录,但是在提取数据时遇到了问题。我能够为单个文件(下面的NCT00006435.xml)执行此操作,但无法弄清楚如何为多个文件执行此操作。
library(XML)
# Download ct.gov query and extract xml files
ct<-tempfile()
dir.create("ctdir")
url<-"https://clinicaltrials.gov/search?term=neurofibromatosis-type-1&studyxml=true"
download.file(url, ct)
unzip(ct, exdir="ctdir")
files<-list.files("ctdir")
# Change the working directory so we don't have to worry about the filepath
setwd("ctdir")
# Extract data from one file and get it into a data frame
#xmlfile<-xmlTreeParse("NCT00006435.xml")
#xmltop<-xmlRoot(xmlfile)
#tags<-xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
#tags_df<-data.frame(t(tags),row.names=NULL)
# Extract data from each file and get it into a data frame
xmlfiles<-lapply(files,function(x) xmlTreeParse(x))
xmltop<-lapply(xmlfiles,function(x) xmlRoot(x))
tags<-???
如何浏览文件列表,循环浏览每个文件中的每个标记?
答案 0 :(得分:0)
str(xmltop)的顶部看起来像:
List of 107
$ :List of 40
..$ comment : Named list()
.. ..- attr(*, "class")= chr [1:5] "XMLCommentNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ...
..$ required_header :List of 3
.. ..$ download_date:List of 1
.. .. ..$ text: Named list()
.. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ...
.. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"
.. ..$ link_text :List of 1
.. .. ..$ text: Named list()
.. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ...
.. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"
.. ..$ url :List of 1
.. .. ..$ text: Named list()
.. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ...
.. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"
.. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"
..$ id_info :List of 4
.. ..$ org_study_id:List of 1
.. .. ..$ text: Named list()
.. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ...
.. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"
所以它是一个列表,您可以使用简单的lapply
“循环”它的顶层。如果你想使用你的单节点案例代码,那就是:
tags<-lapply(xmltop, function(x) xmlSApply(x, xmlValue))
object.size(tags)
1618008 bytes
仍然是一个相当笨拙的对象。我重申我的建议,你找到了一个更易于管理的例子。
答案 1 :(得分:0)
将代码包装在函数中。
tags_df <- function(file){
message("Loading ", file)
#your code
xmlfile<-xmlTreeParse(file)
xmltop<-xmlRoot(xmlfile)
tags_l<-xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
tags<-data.frame(t(tags_l),row.names=NULL)
tags
}
tags<- lapply(files, tags_df)
由于您有一对多的位置,关键字和其他标签,因此将data.frames组合在一起会产生260个列,包括location.1到location.120。我会用一些特定的xpath查询替换你的代码,以便将你真正想要的标签变成易于理解的格式。
x <- ldply(tags, "data.frame")
names(x)