Question

我正在解析从ClinicalTrials.gov下载的xml文件的目录，但是在提取数据时遇到了问题。我能够为单个文件（下面的NCT00006435.xml）执行此操作，但无法弄清楚如何为多个文件执行此操作。

library(XML)
# Download ct.gov query and extract xml files
ct<-tempfile()
dir.create("ctdir")
url<-"https://clinicaltrials.gov/search?term=neurofibromatosis-type-1&studyxml=true"
download.file(url, ct)
unzip(ct, exdir="ctdir")
files<-list.files("ctdir")
# Change the working directory so we don't have to worry about the filepath
setwd("ctdir")

# Extract data from one file and get it into a data frame
#xmlfile<-xmlTreeParse("NCT00006435.xml")
#xmltop<-xmlRoot(xmlfile)
#tags<-xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
#tags_df<-data.frame(t(tags),row.names=NULL)

# Extract data from each file and get it into a data frame
xmlfiles<-lapply(files,function(x) xmlTreeParse(x))
xmltop<-lapply(xmlfiles,function(x) xmlRoot(x))
tags<-???

如何浏览文件列表，循环浏览每个文件中的每个标记？

Answer 1

str（xmltop）的顶部看起来像：

List of 107
 $ :List of 40
  ..$ comment             : Named list()
  .. ..- attr(*, "class")= chr [1:5] "XMLCommentNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ...
  ..$ required_header     :List of 3
  .. ..$ download_date:List of 1
  .. .. ..$ text: Named list()
  .. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ...
  .. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"
  .. ..$ link_text    :List of 1
  .. .. ..$ text: Named list()
  .. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ...
  .. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"
  .. ..$ url          :List of 1
  .. .. ..$ text: Named list()
  .. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ...
  .. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"
  .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"
  ..$ id_info             :List of 4
  .. ..$ org_study_id:List of 1
  .. .. ..$ text: Named list()
  .. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ...
  .. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"

所以它是一个列表，您可以使用简单的lapply“循环”它的顶层。如果你想使用你的单节点案例代码，那就是：

tags<-lapply(xmltop, function(x) xmlSApply(x, xmlValue))
object.size(tags)
1618008 bytes

仍然是一个相当笨拙的对象。我重申我的建议，你找到了一个更易于管理的例子。

Answer 2

将代码包装在函数中。

tags_df <- function(file){
  message("Loading ", file)
  #your code 
  xmlfile<-xmlTreeParse(file)
  xmltop<-xmlRoot(xmlfile)
  tags_l<-xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
  tags<-data.frame(t(tags_l),row.names=NULL)
  tags
}

tags<- lapply(files, tags_df)

由于您有一对多的位置，关键字和其他标签，因此将data.frames组合在一起会产生260个列，包括location.1到location.120。我会用一些特定的xpath查询替换你的代码，以便将你真正想要的标签变成易于理解的格式。

x <- ldply(tags, "data.frame")
names(x)

使用R

2 个答案: