Question

我发现了这个问题和hrbrmstr的回答：“In R, how to extracting two values from XML file, looping over 5603 files and write to table” ...例如，使用原始数据集，但是使用我自己的数据集，我得到一个错误：ans [[1]]中的错误：下标越界

setwd("LOCATION_OF_XML_FILES")

xmlfiles <- list.files(pattern = "*.xml")

dat <- ldply(seq(xmlfiles), function(i){
  doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
  teksti <- xmlValue(doc[["//body"]])
  file <- unlist(strsplit(xmlfiles[i],split=".",fixed=T))[1]
  return(data.frame(file,teksti)) 
})

head(dat)

write.csv(dat, "tekstit_xml.csv", row.names=FALSE)

我的数据集是保密的，所以我担心我无法分享它，但结构是这样的：

<?xml version="1.0" encoding="UTF-8"?>
-<article> <body> flajslkfjlkjaslkjflkajlskjfasjdfjflkdsjalfjdsj 
"alot of text, like a chapter of a book"
 </body> </article>

如果我带走了“teksti＆lt; - xmlValue（doc [[”// body“]]）”，那么代码可以工作，但是当它被包含时我会收到错误：

ans [[1]]：下标超出范围

时出错

你能帮帮我吗？

编辑：我用11个文件试了一下，一切顺利。但是使用530 xml：s它仍然会出错。最大的文件大约有5000个单词。那么data.frame对它的大小有限制吗？

回溯：

 Error in ans[[1]] : subscript out of bounds 

 8 `[[.XMLInternalDocument`(doc, "//body") 

 7 doc[["//body"]] 

 6 xmlValue(doc[["//body"]]) 

 5 FUN(X[[12L]], ...) 

 4 lapply(pieces, .fun, ...) 

 3 structure(lapply(pieces, .fun, ...), dim = dim(pieces)) 

 2 llply(.data = .data, .fun = .fun, ..., .progress = .progress, 
 .inform = .inform, .parallel = .parallel, .paropts = .paropts) 

 1 ldply(seq(xmlfiles), function(i) {
   doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
   teksti <- xmlValue(doc[["//body"]])
   file <- unlist(strsplit(xmlfiles[i], split = ".", fixed = T))[1] ...

Answer 1

您的某个文件缺少“body”标记

xmlValue(doc[["//bodyy"]])
Error in ans[[1]] : subscript out of bounds

您可以使用xpathSApply，并在缺少标记时获取空列表

xpathSApply(doc, "//bodyy", xmlValue)
list()

然后在代码中添加检查以跳过写入data.frame ...

dat <- ldply(seq(xmlfiles), function(i){
  doc <- xmlParse(xmlfiles[i])
  teksti <- xpathSApply(doc, "//body", xmlValue)
  if(length(teksti)==0){
      print(paste("Warning: no body tag in", xmlfiles[i], i))
      teksti <- NA
  }
 file <- unlist(strsplit(xmlfiles[i],split=".",fixed=T))[1]
  return(data.frame(file,teksti)) 

})

许多xml文件到R中的一个csv

1 个答案: