将缺少元素的XML解析并转换为表结构

时间:2014-08-25 16:08:34

标签: xml r dataframe data.table reshape

我正在尝试解析XML文件。它的简化版本如下所示:

x <- '<grandparent><parent><child1>ABC123</child1><child2>1381956044</child2></parent><parent><child2>1397527137</child2></parent><parent><child3>4675</child3></parent><parent><child1>DEF456</child1><child3>3735</child3></parent><parent><child1/><child3>3735</child3></parent></grandparent>'

library(XML)
xmlRoot(xmlTreeParse(x))
## <grandparent>
##   <parent>
##     <child1>ABC123</child1>
##     <child2>1381956044</child2>
##   </parent>
##   <parent>
##     <child2>1397527137</child2>
##   </parent>
##   <parent>
##     <child3>4675</child3>
##   </parent>
##   <parent>
##     <child1>DEF456</child1>
##     <child3>3735</child3>
##   </parent>
##   <parent>
##     <child1/>
##     <child3>3735</child3>
##   </parent>
## </grandparent>

我想将XML转换为data.frame / data.table,如下所示:

parent <- data.frame(child1=c("ABC123",NA,NA,"DEF456",NA), child2=c(1381956044, 1397527137, rep(NA, 3)), child3=c(rep(NA, 2), 4675, 3735, 3735))
parent
##   child1     child2 child3
## 1 ABC123 1381956044     NA
## 2   <NA> 1397527137     NA
## 3   <NA>         NA   4675
## 4 DEF456         NA   3735
## 5   <NA>         NA   3735

如果每个父节点始终包含所有可能的元素(“child1”,“child2”,“child3”等),我可以使用xmlToListunlist来展平它,并且然后dcast将其放入表格中。但XML通常缺少子元素。以下是输出错误的尝试:

library(data.table)

## Flatten:
dt <- as.data.table(unlist(xmlToList(x)), keep.rownames=T)
setnames(dt, c("column", "value"))

## Add row numbers, but they're incorrect due to missing XML elements:
dt[, row:=.SD[,.I], by=column][]
          column      value row
1: parent.child1     ABC123   1
2: parent.child2 1381956044   1
3: parent.child2 1397527137   2
4: parent.child3       4675   1
5: parent.child1     DEF456   2
6: parent.child3       3735   2
7: parent.child3       3735   3

## Reshape from long to wide, but some value are in the wrong row:
dcast.data.table(dt, row~column, value.var="value", fill=NA)
##    row parent.child1 parent.child2 parent.child3
## 1:   1        ABC123    1381956044          4675
## 2:   2        DEF456    1397527137          3735
## 3:   3            NA            NA          3735

我不会提前知道子元素的名称,或祖父母子女的独特元素名称的数量,所以答案应该是灵活的。

更新了示例

实际的XML文件有几层嵌套,使用xmlToDataFrame时出错。这是一个更新的(但仍然简化)版本:

x2 <- '<grandparent><grandparentInfo junk="TRUE"><grandparent1>foo</grandparent1><grandparent1>bar</grandparent1></grandparentInfo><parent><child1>ABC123</child1><child2>1381956044</child2></parent><parent><child2>1397527137</child2></parent><parent><child3>4675</child3></parent><parent><child1>DEF456</child1><child3>3735</child3></parent><parent><child1/><child3>3735</child3></parent></grandparent>'

xmlToDataFrame(x2)
## Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c("foo",  : 
##   duplicate subscripts for columns

2 个答案:

答案 0 :(得分:1)

根据评论中的建议,您只需在xmlToDataFrame

上使用x即可
> library(XML)
> y <- xmlToDataFrame(x)
> y[y == ""] <- NA
> y
#   child1     child2 child3
# 1 ABC123 1381956044   <NA>
# 2   <NA> 1397527137   <NA>
# 3   <NA>       <NA>   4675
# 4 DEF456       <NA>   3735
# 5   <NA>       <NA>   3735

对于data.table结果,

> library(data.table)
> data.table(y)
#    child1     child2 child3
# 1: ABC123 1381956044     NA
# 2:     NA 1397527137     NA
# 3:     NA         NA   4675
# 4: DEF456         NA   3735
# 5:     NA         NA   3735

您可能希望使用colClasses参数将列放入正确的类进行分析。

答案 1 :(得分:1)

在搜索@RichardScriven评论中提到的“列的重复下标”错误时,我发现了一个相关的问题:Import Infopath .XML forms into data frame in R。借用它,我修改了我最初的尝试来获得这个解决方案:

## Convert XML to list
xl <- xmlToList(x2)
#xl[sapply(xl, is.null)] <- NA

## Function that splits the XML path "key" on the last dot, to get the "table" and "column":
splitKey <- function(text) {
  ss <- strsplit(text, "[.]")[[1]]
  lss <- length(ss)
  ifelse(lss==1, out <- c(NA,text), out <- c(paste0(ss[-lss], collapse="."), ss[lss]))
  return(out)
}

## Put flattened list in a data.table, and add the table/column names for each key:
dt2 <- as.data.table(unlist(xl), keep.rownames=T)
setnames(dt2, c("key", "value"))
dt2[, c("table","column"):=as.list(splitKey(key)), by=key][]
##                              key      value                  table       column
##  1: grandparentInfo.grandparent1        foo        grandparentInfo grandparent1
##  2: grandparentInfo.grandparent1        bar        grandparentInfo grandparent1
##  3:  grandparentInfo..attrs.junk       TRUE grandparentInfo..attrs         junk
##  4:                parent.child1     ABC123                 parent       child1
##  5:                parent.child2 1381956044                 parent       child2
##  6:                parent.child2 1397527137                 parent       child2
##  7:                parent.child3       4675                 parent       child3
##  8:                parent.child1     DEF456                 parent       child1
##  9:                parent.child3       3735                 parent       child3
## 10:                parent.child3       3735                 parent       child3

## Get the number of elements within each "parent" section
##  (this is the part I was missing in my original attempts):
newRows <- as.data.table(sapply(xl, length), keep.rownames=T)[V1=="parent", V2]
newRows
## [1] 2 1 1 2 2

## Subset the "parent" table, and add the correct row numbers:
dt2[table=="parent", row:=rep(seq_along(newRows),times=newRows)]
## Warning message:
## In `[.data.table`(dt2, table == "parent", `:=`(row, rep(seq_along(newRows),  :
##   Supplied 8 items to be assigned to 7 items of column 'row' (1 unused)

## Need to fix this answer to include null elements, since the `unlist` command seems to strip them out...

## Reshape from long to wide:
dcast.data.table(dt2[table=="parent"], row~column, value.var="value", fill=NA)
##    row child1     child2 child3
## 1:   1 ABC123 1381956044     NA
## 2:   2     NA 1397527137     NA
## 3:   3     NA         NA   4675
## 4:   4 DEF456         NA   3735
## 5:   5     NA         NA   3735

感觉xmlToDataFrame方法会比这更好,但我需要了解如何更好地对XML进行子集化以便使用它...