我正在尝试读取所有与我的xml文件中的某些模式匹配的xml属性(如下所示是我文件的示例)。实际的xml文件大小约为400 MB,大约有450万行xml节点和属性。
<?xml version="1.0" encoding="utf-8"?>
<events version="1.0">
<event time="10800.0" type="actend" person="9982471" link="21225" actType="home" />
<event time="10800.0" type="departure" person="9982471" link="21225" legMode="car" />
<event time="10800.0" type="PersonEntersVehicle" person="9982471" vehicle="9982471" />
<event time="10800.0" type="actend" person="9656271" link="21066" actType="home" />
<event time="10800.0" type="departure" person="9656271" link="21066" legMode="car" />
<event time="10800.0" type="PersonEntersVehicle" person="9656271" vehicle="9656271" />
<event time="99489.0" type="entered link" person="10777221" link="14182" vehicle="10777221" />
<event time="99498.0" type="left link" person="10777221" link="14182" vehicle="10777221" />
<event time="99498.0" type="entered link" person="10777221" link="14128" vehicle="10777221" />
<event time="99533.0" type="left link" person="10777221" link="14128" vehicle="10777221" />
<event time="99533.0" type="entered link" person="10777221" link="14122" vehicle="10777221" />
<event time="99542.0" type="left link" person="10777221" link="14122" vehicle="10777221" />
<event time="99542.0" type="entered link" person="10777221" link="14100" vehicle="10777221" />
</events>
这是我用来提取感兴趣的数据帧的代码。
library(XML)
file <- "C:/Users/S/Desktop/100.events.test.xml"
popact <- xmlParse(file)
eventsdf <- sapply(c("time","type", "person", "link", "vehicle"), function(x) xpathSApply(popact, "//event[@type='left link']|//event[@type='entered link']", xmlGetAttr, x))
以下是我面临的问题:
"//event[@type='left link']|//event[@type='entered link']"
并使用"//event"
(即在没有特定选择的情况下读取所有属性),我会在大约半小时内获得结果。如何减少代码的运行时间?我应该使用不同的方法来获得我需要的结果吗?答案 0 :(得分:2)
仅当某些节点缺少属性时才需要sapply
。如果没有,如示例中所示,我们可以将其简化为以下xpath
是您的XPath表达式。此外,xpath表达式只遍历节点树一次,因为只有一个//
。
xpath2 <- "//event[@type='left link' or @type='entered link']"
t(xpathSApply(popact, xpath2, xmlAttrs))
以下是时间比较:
library(rbenchmark)
xpath <- "//event[@type='left link']|//event[@type='entered link']"
benchmark(orig = sapply(c("time","type", "person", "link", "vehicle"),
function(x) xpathSApply(popact, xpath, xmlGetAttr, x)),
new = t(xpathSApply(popact, xpath2, xmlAttrs)))[1:4]
,并提供:
test replications elapsed relative
2 new 100 0.07 1.000
1 orig 100 0.68 9.714