我有一个嵌套的xml数据集,如下所示,我试图解析w / xml2 和 tidyverse 包。有三个儿童信封。我想获取每个<card-id>
标记中<value>
和<envelope>
子母子标记的所有文本,并使用易于识别的分隔符(如;;;
)将其折叠,或者制作一个列表data.frames out of the。
library(xml2)
library(tidyverse)
myxml <- read_xml('
<inside>
<envelope>
<card-entry>
<card-id type="integer">605380</card-id>
<value>coke</value>
<random></random>
</card-entry>
<card-entry>
<card-id type="integer">610954</card-id>
<value>pizza</value>
<random>false</random>
</card-entry>
<card-entry>
<card-id type="integer">605381</card-id>
<value>surprise</value>
</card-entry>
<card-entry>
<card-id type="integer">610958</card-id>
<value>joke</value>
<random>true</random>
</card-entry>
</envelope>
<envelope>
<card-entry>
<card-id type="integer">605381</card-id>
<value>charlie horse</value>
</card-entry>
<card-entry>
<card-id type="integer">605380</card-id>
<value>rug bug</value>
</card-entry>
<card-entry>
<card-id type="integer">610954</card-id>
<value>mario cart</value>
</card-entry>
</envelope>
<envelope>
<card-entry>
<card-id type="integer">605377</card-id>
<value>trogdor</value>
</card-entry>
<card-entry>
<card-id type="integer"></card-id>
<value>jorb</value>
</card-entry>
<card-entry>
<card-id type="integer">605333</card-id>
<value></value>
</card-entry>
</envelope>
</inside>
'
)
c(
"605380;;;coke;;;610954;;;pizza;;;605381;;;surprise;;;610958;;;joke",
"605381;;;charlie horse;;;605380;;;rug bug;;;610954;;;mario cart",
"605377;;;trogdor;;;;;;jorb;;;605333;;;"
)
或者像这样好的(可能更好)嵌套列表:
[[1]]
card_id value
1 605380 coke
2 610954 pizza
3 605381 surprise
4 610958 joke
[[2]]
card_id value
1 605381 charlie horse
2 605380 rug bug
3 610954 mario cart
[[3]]
card_id value
1 605377 trogdor
2 <NA> jorb
3 605333 <NA>
我认为我可以对孩子使用as_list
,然后使用xml_find_all
创建data.frames列表,但as_list
+ lapply
不会只攻击一个envelope
,但每次传递都会完成(我很高兴知道我对这个功能的缺失)。
myxml %>%
xml_find_all('//envelope') %>%
as_list() %>%
lapply(function(x){
data_frame(
card_id = x %>% xml_find_all('//card-id') %>% xml_text(),
value = x %>% xml_find_all('//value') %>% xml_text()
)
})
答案 0 :(得分:2)
不完全漂亮,但是你可以通过首先将每个信封的所有子节点分成单独的列表元素来获取data.frames列表,然后循环以从每个card-id和value节点获取文本。
myxml %>%
xml_find_all('//envelope') %>%
lapply(xml_children) %>%
lapply(function(x) data.frame(
card_id = xml_child(x, 'card-id') %>% xml_text,
value = xml_child(x, 'value') %>% xml_text
)
)
#[[1]]
# card_id value
#1 605380 coke
#2 610954 pizza
#3 605381 surprise
#4 610958 joke
#
#[[2]]
# card_id value
#1 605381 charlie horse
#2 605380 rug bug
#3 610954 mario cart
#
#[[3]]
# card_id value
#1 605377 trogdor
#2 jorb
#3 605333
对于NAs而不是“”,您可以在每个%>% ifelse(. == "", NA, .)
之后添加xml_text