Question

我正在尝试在R上抓取XML数据并遇到以下错误 XML链接：http://data.gov.in/sites/default/files/Arecanut(Betelnut_Supari)_2005.xml 代码：

library(RCurl);
library(XML)
test <- readHTMLTable(doc="http://data.gov.in/sites/default/files/Arecanut(Betelnut_Supari)_2005.xml")

错误：

Error in UseMethod("xmlNamespaceDefinitions") : 
  no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"

Answer 1

文件不简单＆＃34;要获得并且，因为它的大小是3MB，所以最好先下载文件然后再处理它。接下来，您还没有解析HTML，您正在阅读XML SOAP响应，因此即使没有命名空间问题，您也不会对readHTMLTable产生太大影响。您尝试提取的记录如下所示：

<diffgr:diffgram>
  <NewDataSet>
    <Table diffgr:id="Table1" msdata:rowOrder="0">
      <State>Assam</State>
      <District>Barpeta</District>
      <Market>Howly</Market>
      <Commodity>Arecanut(Betelnut/Supari)</Commodity>
      <Variety>Other</Variety>
      <Arrival_Date>18/06/2005</Arrival_Date>
      <Min_x0020_Price>5000</Min_x0020_Price>
      <Max_x0020_Price>8000</Max_x0020_Price>
      <Modal_x0020_Price>6500</Modal_x0020_Price>
    </Table>
    …

由于diffgram节点具有命名空间（diffgr），因此您需要提取具有所述命名空间的节点，然后将子节点转换为数据帧行。进一步说明与解决方案一致：

library(XML)
library(data.table)

# be kind to data providers + have the file in case the Internet is down or they
# move the file and, finally, speed up processing later on by having it local
#
# download.file("http://data.gov.in/sites/default/files/Arecanut(Betelnut_Supari)_2005.xml", 
#                destfile="arecanut_2005.xml")

dat <- xmlTreeParse("arecanut_2005.xml", useInternalNodes=TRUE)

# There are namespaces in the XML file, so we need to extract them
nsDefs <- xmlNamespaces(dat, recursive=TRUE)
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))

# the "table" nodes use diffgr:diffgram, so we need to make sure we include
# the diffgr namespace which is ns[9]

nodes <- getNodeSet(dat ,"//diffgr:diffgram/NewDataSet/Table", ns[9])

# we then loop through the nodes, converting each set of values to a 
# data frame then using data.table's rbindlist with `fill=TRUE` just in
# case some records have greater or fewer fields.

tmp <- rbindlist(lapply(nodes, function(x) {
   as.data.frame.list(xmlApply(x, xmlValue))
}), fill=TRUE)

str(tmp)

## Classes ‘data.table’ and 'data.frame':  8127 obs. of  9 variables:
##  $ State            : Factor w/ 9 levels "Assam","Goa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ District         : Factor w/ 30 levels "Barpeta","Darrang",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Market           : Factor w/ 50 levels "Howly","Kharupetia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Commodity        : Factor w/ 1 level "Arecanut(Betelnut/Supari)": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Variety          : Factor w/ 26 levels "Other","Supari",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Arrival_Date     : Factor w/ 323 levels "18/06/2005","19/06/2005",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Min_x0020_Price  : Factor w/ 1923 levels "5000","1000",..: 1 1 1 1 1 1 2 2 2 2 ...
##  $ Max_x0020_Price  : Factor w/ 2394 levels "8000","1250",..: 1 1 1 1 1 1 2 3 3 3 ...
##  $ Modal_x0020_Price: Factor w/ 2385 levels "6500","1100",..: 1 1 1 1 1 1 2 2 2 3 ...
##  - attr(*, ".internal.selfref")=<externalptr> 

head(tmp)

##     State District Market                 Commodity Variety Arrival_Date Min_x0020_Price Max_x0020_Price Modal_x0020_Price
## 1: Assam  Barpeta  Howly Arecanut(Betelnut/Supari)   Other   18/06/2005            5000            8000              6500
## 2: Assam  Barpeta  Howly Arecanut(Betelnut/Supari)   Other   19/06/2005            5000            8000              6500
## 3: Assam  Barpeta  Howly Arecanut(Betelnut/Supari)   Other   20/06/2005            5000            8000              6500
## 4: Assam  Barpeta  Howly Arecanut(Betelnut/Supari)   Other   21/06/2005            5000            8000              6500
## 5: Assam  Barpeta  Howly Arecanut(Betelnut/Supari)   Other   22/06/2005            5000            8000              6500
## 6: Assam  Barpeta  Howly Arecanut(Betelnut/Supari)   Other   24/06/2005            5000            8000              6500

如果您要继续使用这些类型的文件，那么read up a bit对于R中的XML处理符合您的最佳利益。

R XML：UseMethod中的错误（＆＃34; xmlNamespaceDefinitions＆＃34;）

1 个答案: