R中的XML处理:在子节点中使用xmlGetAttr

时间:2016-09-24 01:01:32

标签: r xml

我有几个XML文件,其结构类似于以下结构:

<?xml version='1.0' encoding='UTF-8'?> 

<text>

  <stage></stage>

    <div>
      <intro agent= "Peter"></intro>
        <dialogue agent= "Peter"></dialogue>
      <outro agent= "Stephen"></outro>
    </div>

    <div>
     <intro agent= "Sandra"></intro>
        <dialogue agent= "Peter"></dialogue>
     <outro agent= "Robert"></outro>
    </div>

  <stage></stage>

</text>

我的目标是获取所有“代理人”的列表。我想出了

agents <- xmlApply(xml_processed[["test.xml"]], xmlGetAttr, "agent", default= "-")

但如果它们位于“div”-node中,这只会给我相应的值。 xml_processed是

# preprocess XML

preprocess_xml <- function() {
xmlfiles <- list.files("data/XML", pattern = "*.xml")
path <- "data/XML"
xmlfiles_path <- file.path(path, xmlfiles)

xmlcontent <- list()

for(i in 1:length(xmlfiles)) {
  xmlcontent[[xmlfiles[i]]] <- xmlTreeParse(xmlfiles_path[i])
}

xmlfinal <- list()

for(i in 1:length(xmlcontent)) {
  xmlfinal[[xmlfiles[i]]] <- xmlRoot(xmlcontent[[i]])
}
return(xmlfinal)
}

我也试过

agents <- xmlApply(xml_processed[["test.xml"]], "/text/div/intro", xmlGetAttr, "agent", default= "-")

获取介绍节点的代理。但这只会给我一个错误:

get(as.character(FUN), mode = "function", envir = envir)

1 个答案:

答案 0 :(得分:3)

认为是时候把重点放在XPath而不是R:

txt <- '<?xml version="1.0" encoding="UTF-8"?> 
<text>
  <stage></stage>
    <div>
      <intro agent= "Peter"></intro>
        <dialogue agent= "Peter"></dialogue>
      <outro agent= "Stephen"></outro>
    </div>
    <div>
     <intro agent= "Sandra"></intro>
        <dialogue agent= "Peter"></dialogue>
     <outro agent= "Robert"></outro>
    </div>
  <stage></stage>
</text>'

library(xml2)
library(magrittr)

doc <- read_xml(txt)
xml_find_all(doc, ".//*[@agent]") %>% 
  xml_attr("agent")

如果您必须使用XML包:

library(XML)

doc <- xmlParse(txt)
xpathSApply(doc, "//*[@agent]", xmlGetAttr, "agent")