我有几个XML文件,其结构类似于以下结构:
<?xml version='1.0' encoding='UTF-8'?>
<text>
<stage></stage>
<div>
<intro agent= "Peter"></intro>
<dialogue agent= "Peter"></dialogue>
<outro agent= "Stephen"></outro>
</div>
<div>
<intro agent= "Sandra"></intro>
<dialogue agent= "Peter"></dialogue>
<outro agent= "Robert"></outro>
</div>
<stage></stage>
</text>
我的目标是获取所有“代理人”的列表。我想出了
agents <- xmlApply(xml_processed[["test.xml"]], xmlGetAttr, "agent", default= "-")
但如果它们位于“div”-node中,这只会给我相应的值。 xml_processed是
# preprocess XML
preprocess_xml <- function() {
xmlfiles <- list.files("data/XML", pattern = "*.xml")
path <- "data/XML"
xmlfiles_path <- file.path(path, xmlfiles)
xmlcontent <- list()
for(i in 1:length(xmlfiles)) {
xmlcontent[[xmlfiles[i]]] <- xmlTreeParse(xmlfiles_path[i])
}
xmlfinal <- list()
for(i in 1:length(xmlcontent)) {
xmlfinal[[xmlfiles[i]]] <- xmlRoot(xmlcontent[[i]])
}
return(xmlfinal)
}
我也试过
agents <- xmlApply(xml_processed[["test.xml"]], "/text/div/intro", xmlGetAttr, "agent", default= "-")
获取介绍节点的代理。但这只会给我一个错误:
get(as.character(FUN), mode = "function", envir = envir)
答案 0 :(得分:3)
认为是时候把重点放在XPath而不是R:
txt <- '<?xml version="1.0" encoding="UTF-8"?>
<text>
<stage></stage>
<div>
<intro agent= "Peter"></intro>
<dialogue agent= "Peter"></dialogue>
<outro agent= "Stephen"></outro>
</div>
<div>
<intro agent= "Sandra"></intro>
<dialogue agent= "Peter"></dialogue>
<outro agent= "Robert"></outro>
</div>
<stage></stage>
</text>'
library(xml2)
library(magrittr)
doc <- read_xml(txt)
xml_find_all(doc, ".//*[@agent]") %>%
xml_attr("agent")
如果您必须使用XML
包:
library(XML)
doc <- xmlParse(txt)
xpathSApply(doc, "//*[@agent]", xmlGetAttr, "agent")