将R xml中的不存在节点处理为数据帧

时间:2016-09-19 08:52:25

标签: r xml

我有一个非常类似于这个(Load XML to Dataframe in R with parent node attributes)的情况,我试图将xml转换为df,但是我无法处理不存在的节点“sp”和“ L”。 (我不关心节点“m”)。 假设我的xml看起来像这样:

<text>
<body>
<div1 type="scene1” n="1">
<sp who="fau">
    <l c="30" a="Settle thy studies"/>
    <m x="40" b="To sound the depth of that thou wilt profess"/>
</sp>
<sp who="eang">
        <m x="105" b="Go forward, Faustus, in that famous art"/>
</sp>
</div1>
<div1 type="scene2” n="2">
<sp who="fau">
    <l c="31" a="Settle thy"/>
    <m x="50" b="To sound the depth of"/>
</sp>
<sp who="fau">
    <l c="32" a="Settle"/>
    <m x="60" b="To sound the"/>
</sp>
<sp who="fau">
    <l c="33" a="Settle thy studies, Faustus"/>
    <m x="40" b="To sound the depth of that thou wilt"/>
</sp>
</div1>
<div1 type="scene3” n="3">
</div1>
<div1 type="scene4” n="4">
</div1>
<div1 type="scene5” n="5">
</div1>
</body>
</text>

这是我想要获得的:

n   type      lc     la
1   scene1    30     Settle thy studies
2   scene2    31     Settle thy
2   scene2    32     Settle
2   scene2    33     Settle thy studies, Faustus
3   scene3    NA     NA      
4   scene4    NA     NA
5   scene5    NA     NA

我试过这个:

doc = xmlTreeParse("play.xml", useInternal = TRUE)

bodyToDF <- function(x){
n <- xmlGetAttr(x, "n")
type <- xmlGetAttr(x, "type")
sp <- xpathApply(x, 'sp', function(sp) {
if(is.null(sp)) {
    lc <- NA
    la <- NA
}
lc <- xpathSApply(sp, 'l', function(l) { xmlGetAttr(l,"c")})
la = xpathSApply(sp, 'l', function(l) { xmlValue(l,"a")})
data.frame(n, type, lc, la)
})
do.call(rbind, sp)  
}


res <- xpathApply(doc, '//div1', bodyToDF)

但它不起作用:

Error in data.frame(n, type, lc, la) : 
arguments imply differing number of rows: 1, 0

还有这个:

div1 = sapply(c("n","type"), function(x) xpathSApply(doc, "//div1", xmlGetAttr, x), simplify=FALSE)

l = sapply(c("c","a"), function(x) xpathSApply(doc, "//l", xmlGetAttr, x), simplify=FALSE)

df <- data.frame(div1,l)

但我似乎无法在节点和df行之间得到正确的匹配:

Error in data.frame(div1, l) : 
arguments imply differing number of rows: 5, 4

有什么想法吗?谢谢。

1 个答案:

答案 0 :(得分:0)

您粘贴的XML文本存在问题(一些双引号不是简单的双引号)所以这是其他人的好版本:

txt <- '<text>
    <body>
        <div1 type="scene1" n="1">
            <sp who="fau">
                <l c="30" a="Settle thy studies"/>
                <m x="40" b="To sound the depth of that thou wilt profess"/>
            </sp>
            <sp who="eang">
                <m x="105" b="Go forward, Faustus, in that famous art"/>
            </sp>
        </div1>
        <div1 type="scene2" n="2">
            <sp who="fau">
                <l c="31" a="Settle thy"/>
                <m x="50" b="To sound the depth of"/>
            </sp>
            <sp who="fau">
                <l c="32" a="Settle"/>
                <m x="60" b="To sound the"/>
            </sp>
            <sp who="fau">
                <l c="33" a="Settle thy studies, Faustus"/>
                <m x="40" b="To sound the depth of that thou wilt"/>
            </sp>
        </div1>
        <div1 type="scene3" n="3"></div1>
        <div1 type="scene4" n="4"></div1>
        <div1 type="scene5" n="5"></div1>
    </body>
</text>'

如果真的有必要,可以将以下内容翻译回XML语法,但这个想法类似于其他答案,您需要检查每个“场景”节点并处理缺失值用例(如果发生):

library(xml2)
library(purrr)
library(dplyr)

doc <- read_xml(txt)

xml_find_all(doc, ".//*[contains(@type, 'scene')]") %>% 
  map_df(function(x) {

    scene <- xml_attr(x, "type")
    num <- xml_attr(x, "n")

    lines <- xml_find_all(x, ".//l")

    if (length(lines) == 0) {
      data_frame(n=num, scene=scene, lc=NA, la=NA)
    } else {
      map_df(lines, function(y) {
        lc <- xml_attr(y, "c") %||% NA
        la <- xml_attr(y, "a") %||% NA
        data_frame(n=num, scene=scene, lc=lc, la=la)
      })
    }

  })

并且,它可以为您提供所需的输出:

## # A tibble: 7 × 4
##       n  scene    lc                          la
##   <chr>  <chr> <chr>                       <chr>
## 1     1 scene1    30          Settle thy studies
## 2     2 scene2    31                  Settle thy
## 3     2 scene2    32                      Settle
## 4     2 scene2    33 Settle thy studies, Faustus
## 5     3 scene3  <NA>                        <NA>
## 6     4 scene4  <NA>                        <NA>
## 7     5 scene5  <NA>                        <NA>