在R中解析大型XML文件非常慢

时间:2015-06-10 23:45:02

标签: xml r performance xml-parsing dataframe

我需要从R中的大型xml文件中提取数据。文件大小为60 MB。我使用以下R代码从Internet下载数据:

library(XML)
library(httr)

url = "http://hydro1.sci.gsfc.nasa.gov/daac-bin/his/1.0/NLDAS_NOAH_002.cgi"
SOAPAction = "http://www.cuahsi.org/his/1.0/ws/GetSites"
envelope = "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<soap:Envelope xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:soap=\"http://schemas.xmlsoap.org/soap/envelope/\">\n<soap:Body>\n<GetSites xmlns=\"http://www.cuahsi.org/his/1.0/ws/\">\n<site></site><authToken></authToken>\n</GetSites>\n</soap:Body>\n</soap:Envelope>"

response = POST(url, body = envelope,
             add_headers("Content-Type" = "text/xml", "SOAPAction" = SOAPAction))
status.code = http_status(response)$category

收到服务器的响应后,我使用以下代码将数据解析为data.frame:

# Parse the XML into a tree
WaterML = content(response, as="text")
SOAPdoc = xmlRoot(xmlTreeParse(WaterML, getDTD=FALSE, useInternalNodes = TRUE))
doc = SOAPdoc[[1]][[1]][[1]]

# Allocate a new empty data frame with same name of rows as the number of sites
N = xmlSize(doc) - 1
df = data.frame(SiteName=rep("",N),
             SiteID=rep(NA, N),
             SiteCode=rep("",N),
             Latitude=rep(NA,N),
             Longitude=rep(NA,N),
             stringsAsFactors=FALSE)

# Populate the data frame with the values
# This loop is VERY SLOW it takes around 10 MINUTES!
start.time = Sys.time()

for(i in 1:N){  
  siteInfo = doc[[i+1]][[1]]
  siteList = xmlToList(siteInfo)
  siteName = siteList$siteName
  sCode = siteList$siteCode
  siteCode = sCode$text
  siteID = ifelse(is.null(sCode$.attrs["siteID"]), siteCode,   sCode$.attrs["siteID"])
  latitude = as.numeric(siteList$geoLocation$geogLocation$latitude)
  longitude = as.numeric(siteList$geoLocation$geogLocation$longitude) 
}

end.time = Sys.time()
time.taken = end.time - start.time
time.taken

我用来将XML解析为data.frame的for循环非常慢。大约需要10分钟才能完成。有没有办法让循环更快?

1 个答案:

答案 0 :(得分:3)

通过使用xpath表达式来提取所需的信息,我能够获得更好的性能。每次拨打xpathSApply都需要20秒才能在笔记本电脑上完成,因此所有命令都会在不到2分钟内完成。

# you need to specify the namespace information
ns <- c(soap="http://schemas.xmlsoap.org/soap/envelope/",
        xsd="http://www.w3.org/2001/XMLSchema",
        xsi="http://www.w3.org/2001/XMLSchema-instance",
        sr="http://www.cuahsi.org/waterML/1.0/",
        gsr="http://www.cuahsi.org/his/1.0/ws/")

Data <- list(
  siteName = xpathSApply(SOAPdoc, "//sr:siteName", xmlValue, namespaces=ns),
  siteCode = xpathSApply(SOAPdoc, "//sr:siteCode", xmlValue, namespaces=ns),
  siteID = xpathSApply(SOAPdoc, "//sr:siteCode", xmlGetAttr, name="siteID", namespaces=ns),
  latitude = xpathSApply(SOAPdoc, "//sr:latitude", xmlValue, namespaces=ns),
  longitude = xpathSApply(SOAPdoc, "//sr:longitude", xmlValue, namespaces=ns))
DataFrame <- as.data.frame(Data, stringsAsFactors=FALSE)
DataFrame$latitude <- as.numeric(DataFrame$latitude)
DataFrame$longitude <- as.numeric(DataFrame$longitude)