我需要从R中的大型xml文件中提取数据。文件大小为60 MB。我使用以下R代码从Internet下载数据:
library(XML)
library(httr)
url = "http://hydro1.sci.gsfc.nasa.gov/daac-bin/his/1.0/NLDAS_NOAH_002.cgi"
SOAPAction = "http://www.cuahsi.org/his/1.0/ws/GetSites"
envelope = "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<soap:Envelope xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:soap=\"http://schemas.xmlsoap.org/soap/envelope/\">\n<soap:Body>\n<GetSites xmlns=\"http://www.cuahsi.org/his/1.0/ws/\">\n<site></site><authToken></authToken>\n</GetSites>\n</soap:Body>\n</soap:Envelope>"
response = POST(url, body = envelope,
add_headers("Content-Type" = "text/xml", "SOAPAction" = SOAPAction))
status.code = http_status(response)$category
收到服务器的响应后,我使用以下代码将数据解析为data.frame:
# Parse the XML into a tree
WaterML = content(response, as="text")
SOAPdoc = xmlRoot(xmlTreeParse(WaterML, getDTD=FALSE, useInternalNodes = TRUE))
doc = SOAPdoc[[1]][[1]][[1]]
# Allocate a new empty data frame with same name of rows as the number of sites
N = xmlSize(doc) - 1
df = data.frame(SiteName=rep("",N),
SiteID=rep(NA, N),
SiteCode=rep("",N),
Latitude=rep(NA,N),
Longitude=rep(NA,N),
stringsAsFactors=FALSE)
# Populate the data frame with the values
# This loop is VERY SLOW it takes around 10 MINUTES!
start.time = Sys.time()
for(i in 1:N){
siteInfo = doc[[i+1]][[1]]
siteList = xmlToList(siteInfo)
siteName = siteList$siteName
sCode = siteList$siteCode
siteCode = sCode$text
siteID = ifelse(is.null(sCode$.attrs["siteID"]), siteCode, sCode$.attrs["siteID"])
latitude = as.numeric(siteList$geoLocation$geogLocation$latitude)
longitude = as.numeric(siteList$geoLocation$geogLocation$longitude)
}
end.time = Sys.time()
time.taken = end.time - start.time
time.taken
我用来将XML解析为data.frame的for循环非常慢。大约需要10分钟才能完成。有没有办法让循环更快?
答案 0 :(得分:3)
通过使用xpath表达式来提取所需的信息,我能够获得更好的性能。每次拨打xpathSApply
都需要20秒才能在笔记本电脑上完成,因此所有命令都会在不到2分钟内完成。
# you need to specify the namespace information
ns <- c(soap="http://schemas.xmlsoap.org/soap/envelope/",
xsd="http://www.w3.org/2001/XMLSchema",
xsi="http://www.w3.org/2001/XMLSchema-instance",
sr="http://www.cuahsi.org/waterML/1.0/",
gsr="http://www.cuahsi.org/his/1.0/ws/")
Data <- list(
siteName = xpathSApply(SOAPdoc, "//sr:siteName", xmlValue, namespaces=ns),
siteCode = xpathSApply(SOAPdoc, "//sr:siteCode", xmlValue, namespaces=ns),
siteID = xpathSApply(SOAPdoc, "//sr:siteCode", xmlGetAttr, name="siteID", namespaces=ns),
latitude = xpathSApply(SOAPdoc, "//sr:latitude", xmlValue, namespaces=ns),
longitude = xpathSApply(SOAPdoc, "//sr:longitude", xmlValue, namespaces=ns))
DataFrame <- as.data.frame(Data, stringsAsFactors=FALSE)
DataFrame$latitude <- as.numeric(DataFrame$latitude)
DataFrame$longitude <- as.numeric(DataFrame$longitude)