我试图将xml文件转换为数据帧,但格式似乎已关闭。我已经查看了不同的教程,虽然我在使用for循环和导航解析文件获取所需信息方面取得了一定的成功,但我还是被告知这个解决方案效率不高
然后我尝试了这段代码:
require(XML)
parsed<-xmlParse("SEWL.xml")
xmlToDataFrame(parsed)
但它给出了一个错误:[<-.data.frame
(*tmp*
,i,名称(节点[[i]]),值= c(&#34; \&#34; LL18179 \ &#34; \&#34; 2016/08 \&#34; 0.32485.43896.59801.2131 \&#34; OK \&#34;&#34;,:
列的重复下标
这个其他代码有效,但格式不是我需要的:
require(XML)
require(plyr)
pldf<-ldply(xmlToList("SEWL.xml"),data.frame)
结果数据框如下:
.id X..i.. text .attrs test.code test.validuntil test.meas.text test.meas..attrs test.meas.text.1
1 technician "John" <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 location "CO" <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 temp <NA> 21.3 celsius <NA> <NA> <NA> <NA> <NA>
4 runtype "routine" <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 sample <NA> <NA> 2323 "LL18179" "2016/08" 0.3248 baseline 5.4389
6 sample <NA> <NA> 2323 "LL18179" "2016/08" 0.3248 baseline 5.4389
7 sample <NA> <NA> 8979237 "AA09453" "2016/03" 0.0117 baseline 5.6012
8 sample <NA> <NA> 8979237 "AA09453" "2016/03" 0.0117 baseline 5.6012
9 .attrs 2015_07_31_11_33_22 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
10 .attrs 20150731 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
11 .attrs 113322 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
test.meas..attrs.1 test.meas.text.2 test.meas..attrs.2 test.calc test.result test..attrs test.code.1 test.validuntil.1
1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 std 6.5980 data 1.2131 "OK" laslum "ATR150607" "2017/05"
6 std 6.5980 data 1.2131 "OK" 3 "ATR150607" "2017/05"
7 std 1.1431 data 0.2041 "FAIL" absat <NA> <NA>
8 std 1.1431 data 0.2041 "FAIL" 2 <NA> <NA>
9 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
10 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
11 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
test.meas.text.3 test.meas..attrs.3 test.meas.text.4 test.meas..attrs.4 test.meas.text.5 test.meas..attrs.5
1 <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA> <NA> <NA>
4 <NA> <NA> <NA> <NA> <NA> <NA>
5 0.0673 baseline 4.9721 std 10.3851 data
6 0.0673 baseline 4.9721 std 10.3851 data
7 <NA> <NA> <NA> <NA> <NA> <NA>
8 <NA> <NA> <NA> <NA> <NA> <NA>
9 <NA> <NA> <NA> <NA> <NA> <NA>
10 <NA> <NA> <NA> <NA> <NA> <NA>
11 <NA> <NA> <NA> <NA> <NA> <NA>
test.calc.1 test.result.1 test..attrs.1
1 <NA> <NA> <NA>
2 <NA> <NA> <NA>
3 <NA> <NA> <NA>
4 <NA> <NA> <NA>
5 2.0886 "Warning" atr
6 2.0886 "Warning" 1
7 <NA> <NA> <NA>
8 <NA> <NA> <NA>
9 <NA> <NA> <NA>
10 <NA> <NA> <NA>
11 <NA> <NA> <NA>
这是我使用的示例XML文件:
<?xml version="1.0" encoding="UTF-8"?>
<experiment name="abc123" date="20150731" time="113322">
<technician>"John"</technician>
<location>"CO"</location>
<temp scale="celsius">21.3</temp>
<runtype>"routine"</runtype>
<sample id="2323">
<test name="laslum" order="3">
<code>"LL18179"</code>
<validuntil>"2016/08"</validuntil>
<meas name="baseline">0.3248</meas>
<meas name="std">5.4389</meas>
<meas name="data">6.5980</meas>
<calc>1.2131</calc>
<result>"OK"</result>
</test>
<test name="atr" order="1">
<code>"ATR150607"</code>
<validuntil>"2017/05"</validuntil>
<meas name="baseline">0.0673</meas>
<meas name="std">4.9721</meas>
<meas name="data">10.3851</meas>
<calc>2.0886</calc>
<result>"Warning"</result>
</test>
</sample>
<sample id="8979237">
<test name="absat" order="2">
<code>"AA09453"</code>
<validuntil>"2016/03"</validuntil>
<meas name="baseline">0.0117</meas>
<meas name="std">5.6012</meas>
<meas name="data">1.1431</meas>
<calc>0.2041</calc>
<result>"FAIL"</result>
</test>
</sample>
</experiment>
我希望得到的数据框:
experiment technician location temp runtype sample test order code validuntil baseline std data calc result date time
1 abc123 John CO 21.3 routine 2323 laslum 3 LL18179 2016/08 0.3248 5.4389 6.5980 1.2131 OK 20150731 113322
2 abc123 John CO 21.3 routine 2323 atr 1 ATR150607 2017/05 0.0673 4.9721 10.3851 2.0886 Warning 20150731 113322
3 abc123 John CO 21.3 routine 8979237 absat 2 AA09453 2016/03 0.0117 5.6012 1.1431 0.2041 FAIL 20150731 113322
我不需要完全相同的格式,只需要足够接近的格式,以便将其转换为示例。
答案 0 :(得分:6)
我们提供了两种解析XML的方法。第一个(对实验/样本/测试执行三次迭代)可能会运行得更快,但第二个(在测试节点上使用单个循环并在每个测试节点上通过树返回以获取其祖先)具有更简单的代码。
1)最后在Note中使用Lines
,我们在实验/样本/测试节点上实现三次xpathApply / xpathSApply迭代。 e
,s
和t
分别代表当前此类节点。
library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
do.call("rbind", xpathApply(doc, "//experiment", function(e) {
data.frame(experiment = xmlAttrs(e)[["name"]],
technician = xmlValue(e[["technician"]]),
location = xmlValue(e[["location"]]),
temp = xmlValue(e[["temp"]]),
runtype = xmlValue(e[["runtype"]]),
t(do.call(cbind, xpathApply(e, "sample", function(s) {
sample <- xmlAttrs(s)[["id"]]
xpathSApply(s, "test", function(t) {
c(sample = sample,
test = xmlAttrs(t)[["name"]],
order = xmlAttrs(t)[["order"]],
code = xmlValue(t[["code"]]),
validuntil = xmlValue(t[["validuntil"]]),
baseline = xmlValue(t["meas"][[1]]),
std = xmlValue(t["meas"][[2]]),
data = xmlValue(t["meas"][[3]]),
calc = xmlValue(t[["calc"]]),
result = xmlValue(t[["result"]])
)})}))),
date = xmlAttrs(e)[["date"]],
time = xmlAttrs(e)[["time"]]
)}))
,并提供:
experiment technician location temp runtype sample test order
1 abc123 "John" "CO" 21.3 "routine" 2323 laslum 3
2 abc123 "John" "CO" 21.3 "routine" 2323 atr 1
3 abc123 "John" "CO" 21.3 "routine" 8979237 absat 2
code validuntil baseline std data calc result date
1 "LL18179" "2016/08" 0.3248 5.4389 6.5980 1.2131 "OK" 20150731
2 "ATR150607" "2017/05" 0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3 "AA09453" "2016/03" 0.0117 5.6012 1.1431 0.2041 "FAIL" 20150731
time
1 113322
2 113322
3 113322
2)这是一种替代方法,我们只在测试节点上循环,然后向上到达父母和祖父母,以获得相应的样本和经验信息。
library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
do.call("rbind", xpathApply(doc, "//test", function(t) { # t is test node
s <- xmlParent(t) # s is sample node
e <- xmlParent(s) # e is experiment node
data.frame(experiment = xmlAttrs(e)[["name"]],
technician = xmlValue(e[["technician"]]),
location = xmlValue(e[["location"]]),
temp = xmlValue(e[["temp"]]),
runtype = xmlValue(e[["runtype"]]),
sample = xmlAttrs(s)[["id"]],
test = xmlAttrs(t)[["name"]],
order = xmlAttrs(t)[["order"]],
code = xmlValue(t[["code"]]),
validuntil = xmlValue(t[["validuntil"]]),
baseline = xmlValue(t["meas"][[1]]),
std = xmlValue(t["meas"][[2]]),
data = xmlValue(t["meas"][[3]]),
calc = xmlValue(t[["calc"]]),
result = xmlValue(t[["result"]]),
date = xmlAttrs(e)[["date"]],
time = xmlAttrs(e)[["time"]]
)
}))
,并提供:
experiment technician location temp runtype sample test order
1 abc123 "John" "CO" 21.3 "routine" 2323 laslum 3
2 abc123 "John" "CO" 21.3 "routine" 2323 atr 1
3 abc123 "John" "CO" 21.3 "routine" 8979237 absat 2
code validuntil baseline std data calc result date
1 "LL18179" "2016/08" 0.3248 5.4389 6.5980 1.2131 "OK" 20150731
2 "ATR150607" "2017/05" 0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3 "AA09453" "2016/03" 0.0117 5.6012 1.1431 0.2041 "FAIL" 20150731
time
1 113322
2 113322
3 113322
注1:
顺便说一下,如果您将输入XML文件SEWL.xml读入Excel,它将合理地将其放入表格格式中,尽管需要进行一些进一步处理才能将其精确地放入表格中。问题
注2:
作为R对象的输入Lines
是:
Lines <- '<?xml version="1.0" encoding="UTF-8"?>
<experiment name="abc123" date="20150731" time="113322">
<technician>"John"</technician>
<location>"CO"</location>
<temp scale="celsius">21.3</temp>
<runtype>"routine"</runtype>
<sample id="2323">
<test name="laslum" order="3">
<code>"LL18179"</code>
<validuntil>"2016/08"</validuntil>
<meas name="baseline">0.3248</meas>
<meas name="std">5.4389</meas>
<meas name="data">6.5980</meas>
<calc>1.2131</calc>
<result>"OK"</result>
</test>
<test name="atr" order="1">
<code>"ATR150607"</code>
<validuntil>"2017/05"</validuntil>
<meas name="baseline">0.0673</meas>
<meas name="std">4.9721</meas>
<meas name="data">10.3851</meas>
<calc>2.0886</calc>
<result>"Warning"</result>
</test>
</sample>
<sample id="8979237">
<test name="absat" order="2">
<code>"AA09453"</code>
<validuntil>"2016/03"</validuntil>
<meas name="baseline">0.0117</meas>
<meas name="std">5.6012</meas>
<meas name="data">1.1431</meas>
<calc>0.2041</calc>
<result>"FAIL"</result>
</test>
</sample>
</experiment>'