使用父节点属性将XML加载到R中的Dataframe

时间:2015-03-03 07:35:26

标签: xml r dataframe tei

我有一个XML文件(一个TEI编码的游戏)我要处理成R中的data.frame,其中data.frame的每一行包含一行播放,行号,扬声器该行,场景编号和场景类型。 XML文件的主体看起来像这样(但更长):

<text>
<body>
<div1 type="scene" n="1">
    <sp who="fau">
        <l n="30">Settle thy studies, Faustus, and begin</l>
        <l n="31">To sound the depth of that thou wilt profess;</l>
        <l n="32">Having commenced, be a divine in show,</l>
    </sp>
    <sp who="eang">
        <l n="105">Go forward, Faustus, in that famous art,</l>
    </sp>
</div1>
<div1 type="scene" n="2">
    <sp who="sch1">
        <l n="NA">I wonder what's become of Faustus, that was wont to make our schools ring with sic probo.</l>
    </sp>
    <sp who="sch2">
        <l n="NA">That shall we know, for see here comes his boy.</l>
    </sp>
    <sp who="sch1">
        <l n="NA">How now sirrah, where's thy master?</l>
    </sp>
    <sp who="wag">
        <l n="NA">God in heaven knows.</l>
    </sp>   
</div1>
</body>
</text>

问题似乎与herehere提出的问题类似,但我的XML文件结构略有不同,所以两者都没有给我一个有效的解决方案。我设法做到了:

library(XML)
doc <- xmlTreeParse("data/faustus_sample.xml", useInternalNodes=TRUE)

bodyToDF <- function(x){
  scenenum <- xmlGetAttr(x, "n")
  scenetype <- xmlGetAttr(x, "type")
  attributes <- sapply(xmlChildren(x, omitNodeTypes = "XMLInternalTextNode"), xmlAttrs)
  linecontent <- sapply(xmlChildren(x), xmlValue)
  data.frame(scenenum = scenenum, scenetype = scenetype, attributes = attributes, linecontent = linecontent, stringsAsFactors = FALSE)
}

res <- xpathApply(doc, '//div1', bodyToDF)
temp.df <- do.call(rbind, res)

这将返回一个data.frame,其中'scene number','scene type'和'speaker'完好无损,但我无法弄清楚如何将其分解为每一行(并获取相关的行号)。

我尝试将文件作为列表导入(通过xmlToList),但是这给了我一个非常混乱的列表列表列表,如果我尝试使用for循环来访问不同的列表,它也会导致很多不同的错误元素(糟透了,我知道!)。

理想情况下,我正在寻找一种解决方案,该解决方案可以在整个文件中处理所有混乱,也适用于其他类似结构的XML文件。

我刚开始使用R而且完全不知所措。我们非常感谢您提供的任何帮助。

感谢您的帮助!

编辑:完整的xml文件的副本可用here

1 个答案:

答案 0 :(得分:1)

为sp元素添加了额外的xpathApply:

bodyToDF <- function(x){
  scenenum <- xmlGetAttr(x, "n")
  scenetype <- xmlGetAttr(x, "type")
  sp <- xpathApply(x, 'sp', function(sp) {
    who <- xmlGetAttr(sp, "who")
    if(is.null(who))
      who <- NA
    line_num <- xpathSApply(sp, 'l', function(l) { xmlGetAttr(l,"n")})
    linecontent = xpathSApply(sp, 'l', function(l) { xmlValue(l,"n")})
    data.frame( scenenum, scenetype, who, line_num, linecontent)
  })
  do.call(rbind, sp)  
}

res <- xpathApply(doc, '//div1', bodyToDF)
temp.df <- do.call(rbind, res)

前4栏:

# > temp.df[,1:4]
#   scenenum scenetype  who line_num
# 1        1     scene  fau       30
# 2        1     scene  fau       31
# 3        1     scene  fau       32
# 4        1     scene eang      105
# 5        2     scene sch1       NA
# 6        2     scene sch2       NA
# 7        2     scene sch1       NA
# 8        2     scene  wag       NA