Question

我有一个包含多个XML声明的文件，我能够检测到它们并从这篇文章Parseing XML by R always return XML declaration error中逐一读取它们。数据来自：https://www.google.com/googlebooks/uspto-patents-applications-text.html。

### read xml document with more than one <?xml declaration in R

lines   <- readLines("pa020829.xml")
start   <- grep('<?xml version="1.0" encoding="UTF-8"?>',lines,fixed=T)
end     <- c(start[-1]-1,length(lines))

get.xml <- function(i) {
  txt <- paste(lines[start[i]:end[i]],collapse="\n")
  # print(i)
  xmlTreeParse(txt,asText=T)
  # return(i)
}
docs <- lapply(1:10,get.xml)

> class(docs)
[1] "list"
> class(docs[1])
[1] "list"
> class(docs[[1]])
[1] "XMLDocument"         "XMLAbstractDocument"

文件 docs 包含10个类似的文档，分别称为 docs [[1]]，docs [[{2]]，... 。我设法提取单个文档的根并将其插入矩阵：

root <- xmlRoot(docs[[1]])

d <- rbind(unlist(xmlSApply(root[[1]], function(x) xmlSApply(x, xmlValue))))

但是，我需要编写代码来自动检索所有10个文档的数据并将它们附加到单个数据框中。我尝试了下面的代码，但它仅检索第一个文档根目录的数据并将其多次附加到矩阵。

d <- lapply(docs, function(x) rbind(unlist(xmlSApply(root, function(x) xmlSApply(x, xmlValue)))))

我想我需要更改在函数中调用根的方式。

关于如何使用所有文档中的数据创建矩阵的想法吗？

Answer 1

以下代码将返回一个矩阵，其中包含来自所有文档的数据：

getXmlInternal <- function(x) {
  rbind(unlist(xmlSApply(xmlRoot(x), function(y) xmlSApply(y, xmlValue))))
}

d <- rbind(lapply(docs, function(x) getXmlInternal(x)))

这通过在lapply命令提供的每个文档上运行该命令来解决您提到的xmlRoot问题。 lapply命令包含在对rbind的调用中，以确保输出位于所请求的矩阵中。

包含getXmlInternal函数可以使答案更具可读性。

在R

1 个答案: