Question

我是XML新手。我从谷歌驱动器http://www.google.com/googlebooks/uspto-patents-grants-text.html下载了一个名为ipg140722的XML文件，我使用了Window 8.1，R 3.1.1，

library(XML)
url<- "E:\\clouddownload\\R-download\\ipg140722.xml"
indata<- xmlTreeParse(url)

XML declaration allowed only at the start of the document
Extra content at the end of the document
error: 1: XML declaration allowed only at the start of the document
2: Extra content at the end of the document

  what is the problem

Answer 1

注意：此帖子是从原始版本编辑的。

这里的对象课程是因为文件具有xml扩展名并不意味着它是格式良好的XML。

如果@MartinMorgan对该文件是正确的，那么Google似乎已经在2014-07-22（上周）的一周内获得了所有已批准的专利，将它们转换为XML，将它们串在一起形成一个文本文件，鉴于xml扩展名。显然，这是不是格式良好的XML。因此，挑战在于解构该文件。这是在R中完成的。

lines   <- readLines("ipg140722.xml")
start   <- grep('<?xml version="1.0" encoding="UTF-8"?>',lines,fixed=T)
end     <- c(start[-1]-1,length(lines))
library(XML)
get.xml <- function(i) {
  txt <- paste(lines[start[i]:end[i]],collapse="\n")
  # print(i)
  xmlTreeParse(txt,asText=T)
  # return(i)
}
docs <- lapply(1:10,get.xml)
class(docs[[1]])
# [1] "XMLInternalDocument" "XMLAbstractDocument"

所以现在docs是已解析的XML文档列表。这些可以单独访问，例如docs[[1]]，或者使用类似下面的代码集中访问，这些代码从每个文档中提取发明标题。

sapply(docs,function(doc) xmlValue(doc["//invention-title"][[1]]))
#  [1] "Phallus retention harness"                          "Dress/coat"                                        
#  [3] "Shirt"                                              "Shirt"                                             
#  [5] "Sandal"                                             "Shoe"                                              
#  [7] "Footwear"                                           "Flexible athletic shoe sole"                       
#  [9] "Shoe outsole with a surface ornamentation contrast" "Shoe sole"

不，我没有不构成第一项专利的名称。

对OP评论的回应

我的原始帖子，它使用以下方式检测到新文档的开头：

start   <- grep("xml version",lines,fixed=T)

太天真了：结果是短语＆＃34; xml版本＆＃34; 出现在某些专利的文本中。因此，这过早地破坏了（某些）文档，导致格式错误的XML。上面的代码修复了这个问题。如果取消函数get.xml(...)中的两行，并使用

运行上面的代码

docs <- lapply(1:length(start),get.xml)

您将看到所有6961个文档都正确解析。

但还有另一个问题：解析后的XML非常大，所以如果你将这些行留作注释并尝试解析整个集合，那么你的内存耗尽一半（或者我在8GB系统上运行））。有两种方法可以解决这个问题。第一种是在块中进行解析（一次说2000个文档）。第二种是在get.xml(...)中提取CSV文件所需的任何信息，并在每一步丢弃已解析的文档。

通过R解析XML总是返回XML声明错误

1 个答案: