我有一个非常类似于这个(Load XML to Dataframe in R with parent node attributes)的情况,我试图将xml转换为df,但是我无法处理不存在的节点“sp”和“ L”。 (我不关心节点“m”)。 假设我的xml看起来像这样:
<text>
<body>
<div1 type="scene1” n="1">
<sp who="fau">
<l c="30" a="Settle thy studies"/>
<m x="40" b="To sound the depth of that thou wilt profess"/>
</sp>
<sp who="eang">
<m x="105" b="Go forward, Faustus, in that famous art"/>
</sp>
</div1>
<div1 type="scene2” n="2">
<sp who="fau">
<l c="31" a="Settle thy"/>
<m x="50" b="To sound the depth of"/>
</sp>
<sp who="fau">
<l c="32" a="Settle"/>
<m x="60" b="To sound the"/>
</sp>
<sp who="fau">
<l c="33" a="Settle thy studies, Faustus"/>
<m x="40" b="To sound the depth of that thou wilt"/>
</sp>
</div1>
<div1 type="scene3” n="3">
</div1>
<div1 type="scene4” n="4">
</div1>
<div1 type="scene5” n="5">
</div1>
</body>
</text>
这是我想要获得的:
n type lc la
1 scene1 30 Settle thy studies
2 scene2 31 Settle thy
2 scene2 32 Settle
2 scene2 33 Settle thy studies, Faustus
3 scene3 NA NA
4 scene4 NA NA
5 scene5 NA NA
我试过这个:
doc = xmlTreeParse("play.xml", useInternal = TRUE)
bodyToDF <- function(x){
n <- xmlGetAttr(x, "n")
type <- xmlGetAttr(x, "type")
sp <- xpathApply(x, 'sp', function(sp) {
if(is.null(sp)) {
lc <- NA
la <- NA
}
lc <- xpathSApply(sp, 'l', function(l) { xmlGetAttr(l,"c")})
la = xpathSApply(sp, 'l', function(l) { xmlValue(l,"a")})
data.frame(n, type, lc, la)
})
do.call(rbind, sp)
}
res <- xpathApply(doc, '//div1', bodyToDF)
但它不起作用:
Error in data.frame(n, type, lc, la) :
arguments imply differing number of rows: 1, 0
还有这个:
div1 = sapply(c("n","type"), function(x) xpathSApply(doc, "//div1", xmlGetAttr, x), simplify=FALSE)
l = sapply(c("c","a"), function(x) xpathSApply(doc, "//l", xmlGetAttr, x), simplify=FALSE)
df <- data.frame(div1,l)
但我似乎无法在节点和df行之间得到正确的匹配:
Error in data.frame(div1, l) :
arguments imply differing number of rows: 5, 4
有什么想法吗?谢谢。
答案 0 :(得分:0)
您粘贴的XML文本存在问题(一些双引号不是简单的双引号)所以这是其他人的好版本:
txt <- '<text>
<body>
<div1 type="scene1" n="1">
<sp who="fau">
<l c="30" a="Settle thy studies"/>
<m x="40" b="To sound the depth of that thou wilt profess"/>
</sp>
<sp who="eang">
<m x="105" b="Go forward, Faustus, in that famous art"/>
</sp>
</div1>
<div1 type="scene2" n="2">
<sp who="fau">
<l c="31" a="Settle thy"/>
<m x="50" b="To sound the depth of"/>
</sp>
<sp who="fau">
<l c="32" a="Settle"/>
<m x="60" b="To sound the"/>
</sp>
<sp who="fau">
<l c="33" a="Settle thy studies, Faustus"/>
<m x="40" b="To sound the depth of that thou wilt"/>
</sp>
</div1>
<div1 type="scene3" n="3"></div1>
<div1 type="scene4" n="4"></div1>
<div1 type="scene5" n="5"></div1>
</body>
</text>'
如果真的有必要,可以将以下内容翻译回XML
语法,但这个想法类似于其他答案,您需要检查每个“场景”节点并处理缺失值用例(如果发生):
library(xml2)
library(purrr)
library(dplyr)
doc <- read_xml(txt)
xml_find_all(doc, ".//*[contains(@type, 'scene')]") %>%
map_df(function(x) {
scene <- xml_attr(x, "type")
num <- xml_attr(x, "n")
lines <- xml_find_all(x, ".//l")
if (length(lines) == 0) {
data_frame(n=num, scene=scene, lc=NA, la=NA)
} else {
map_df(lines, function(y) {
lc <- xml_attr(y, "c") %||% NA
la <- xml_attr(y, "a") %||% NA
data_frame(n=num, scene=scene, lc=lc, la=la)
})
}
})
并且,它可以为您提供所需的输出:
## # A tibble: 7 × 4
## n scene lc la
## <chr> <chr> <chr> <chr>
## 1 1 scene1 30 Settle thy studies
## 2 2 scene2 31 Settle thy
## 3 2 scene2 32 Settle
## 4 2 scene2 33 Settle thy studies, Faustus
## 5 3 scene3 <NA> <NA>
## 6 4 scene4 <NA> <NA>
## 7 5 scene5 <NA> <NA>