我正在尝试在R上抓取XML数据并遇到以下错误 XML链接:http://data.gov.in/sites/default/files/Arecanut(Betelnut_Supari)_2005.xml 代码:
library(RCurl);
library(XML)
test <- readHTMLTable(doc="http://data.gov.in/sites/default/files/Arecanut(Betelnut_Supari)_2005.xml")
错误:
Error in UseMethod("xmlNamespaceDefinitions") :
no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
答案 0 :(得分:1)
文件不简单&#34;要获得并且,因为它的大小是3MB,所以最好先下载文件然后再处理它。接下来,您还没有解析HTML,您正在阅读XML SOAP响应,因此即使没有命名空间问题,您也不会对readHTMLTable
产生太大影响。您尝试提取的记录如下所示:
<diffgr:diffgram>
<NewDataSet>
<Table diffgr:id="Table1" msdata:rowOrder="0">
<State>Assam</State>
<District>Barpeta</District>
<Market>Howly</Market>
<Commodity>Arecanut(Betelnut/Supari)</Commodity>
<Variety>Other</Variety>
<Arrival_Date>18/06/2005</Arrival_Date>
<Min_x0020_Price>5000</Min_x0020_Price>
<Max_x0020_Price>8000</Max_x0020_Price>
<Modal_x0020_Price>6500</Modal_x0020_Price>
</Table>
…
由于diffgram
节点具有命名空间(diffgr
),因此您需要提取具有所述命名空间的节点,然后将子节点转换为数据帧行。进一步说明与解决方案一致:
library(XML)
library(data.table)
# be kind to data providers + have the file in case the Internet is down or they
# move the file and, finally, speed up processing later on by having it local
#
# download.file("http://data.gov.in/sites/default/files/Arecanut(Betelnut_Supari)_2005.xml",
# destfile="arecanut_2005.xml")
dat <- xmlTreeParse("arecanut_2005.xml", useInternalNodes=TRUE)
# There are namespaces in the XML file, so we need to extract them
nsDefs <- xmlNamespaces(dat, recursive=TRUE)
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))
# the "table" nodes use diffgr:diffgram, so we need to make sure we include
# the diffgr namespace which is ns[9]
nodes <- getNodeSet(dat ,"//diffgr:diffgram/NewDataSet/Table", ns[9])
# we then loop through the nodes, converting each set of values to a
# data frame then using data.table's rbindlist with `fill=TRUE` just in
# case some records have greater or fewer fields.
tmp <- rbindlist(lapply(nodes, function(x) {
as.data.frame.list(xmlApply(x, xmlValue))
}), fill=TRUE)
str(tmp)
## Classes ‘data.table’ and 'data.frame': 8127 obs. of 9 variables:
## $ State : Factor w/ 9 levels "Assam","Goa",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ District : Factor w/ 30 levels "Barpeta","Darrang",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Market : Factor w/ 50 levels "Howly","Kharupetia",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Commodity : Factor w/ 1 level "Arecanut(Betelnut/Supari)": 1 1 1 1 1 1 1 1 1 1 ...
## $ Variety : Factor w/ 26 levels "Other","Supari",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Arrival_Date : Factor w/ 323 levels "18/06/2005","19/06/2005",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Min_x0020_Price : Factor w/ 1923 levels "5000","1000",..: 1 1 1 1 1 1 2 2 2 2 ...
## $ Max_x0020_Price : Factor w/ 2394 levels "8000","1250",..: 1 1 1 1 1 1 2 3 3 3 ...
## $ Modal_x0020_Price: Factor w/ 2385 levels "6500","1100",..: 1 1 1 1 1 1 2 2 2 3 ...
## - attr(*, ".internal.selfref")=<externalptr>
head(tmp)
## State District Market Commodity Variety Arrival_Date Min_x0020_Price Max_x0020_Price Modal_x0020_Price
## 1: Assam Barpeta Howly Arecanut(Betelnut/Supari) Other 18/06/2005 5000 8000 6500
## 2: Assam Barpeta Howly Arecanut(Betelnut/Supari) Other 19/06/2005 5000 8000 6500
## 3: Assam Barpeta Howly Arecanut(Betelnut/Supari) Other 20/06/2005 5000 8000 6500
## 4: Assam Barpeta Howly Arecanut(Betelnut/Supari) Other 21/06/2005 5000 8000 6500
## 5: Assam Barpeta Howly Arecanut(Betelnut/Supari) Other 22/06/2005 5000 8000 6500
## 6: Assam Barpeta Howly Arecanut(Betelnut/Supari) Other 24/06/2005 5000 8000 6500
如果您要继续使用这些类型的文件,那么read up a bit对于R中的XML处理符合您的最佳利益。