我正在尝试解析XML文件。它的简化版本如下所示:
x <- '<grandparent><parent><child1>ABC123</child1><child2>1381956044</child2></parent><parent><child2>1397527137</child2></parent><parent><child3>4675</child3></parent><parent><child1>DEF456</child1><child3>3735</child3></parent><parent><child1/><child3>3735</child3></parent></grandparent>'
library(XML)
xmlRoot(xmlTreeParse(x))
## <grandparent>
## <parent>
## <child1>ABC123</child1>
## <child2>1381956044</child2>
## </parent>
## <parent>
## <child2>1397527137</child2>
## </parent>
## <parent>
## <child3>4675</child3>
## </parent>
## <parent>
## <child1>DEF456</child1>
## <child3>3735</child3>
## </parent>
## <parent>
## <child1/>
## <child3>3735</child3>
## </parent>
## </grandparent>
我想将XML转换为data.frame / data.table,如下所示:
parent <- data.frame(child1=c("ABC123",NA,NA,"DEF456",NA), child2=c(1381956044, 1397527137, rep(NA, 3)), child3=c(rep(NA, 2), 4675, 3735, 3735))
parent
## child1 child2 child3
## 1 ABC123 1381956044 NA
## 2 <NA> 1397527137 NA
## 3 <NA> NA 4675
## 4 DEF456 NA 3735
## 5 <NA> NA 3735
如果每个父节点始终包含所有可能的元素(“child1”,“child2”,“child3”等),我可以使用xmlToList
和unlist
来展平它,并且然后dcast
将其放入表格中。但XML通常缺少子元素。以下是输出错误的尝试:
library(data.table)
## Flatten:
dt <- as.data.table(unlist(xmlToList(x)), keep.rownames=T)
setnames(dt, c("column", "value"))
## Add row numbers, but they're incorrect due to missing XML elements:
dt[, row:=.SD[,.I], by=column][]
column value row
1: parent.child1 ABC123 1
2: parent.child2 1381956044 1
3: parent.child2 1397527137 2
4: parent.child3 4675 1
5: parent.child1 DEF456 2
6: parent.child3 3735 2
7: parent.child3 3735 3
## Reshape from long to wide, but some value are in the wrong row:
dcast.data.table(dt, row~column, value.var="value", fill=NA)
## row parent.child1 parent.child2 parent.child3
## 1: 1 ABC123 1381956044 4675
## 2: 2 DEF456 1397527137 3735
## 3: 3 NA NA 3735
我不会提前知道子元素的名称,或祖父母子女的独特元素名称的数量,所以答案应该是灵活的。
实际的XML文件有几层嵌套,使用xmlToDataFrame
时出错。这是一个更新的(但仍然简化)版本:
x2 <- '<grandparent><grandparentInfo junk="TRUE"><grandparent1>foo</grandparent1><grandparent1>bar</grandparent1></grandparentInfo><parent><child1>ABC123</child1><child2>1381956044</child2></parent><parent><child2>1397527137</child2></parent><parent><child3>4675</child3></parent><parent><child1>DEF456</child1><child3>3735</child3></parent><parent><child1/><child3>3735</child3></parent></grandparent>'
xmlToDataFrame(x2)
## Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c("foo", :
## duplicate subscripts for columns
答案 0 :(得分:1)
根据评论中的建议,您只需在xmlToDataFrame
x
即可
> library(XML)
> y <- xmlToDataFrame(x)
> y[y == ""] <- NA
> y
# child1 child2 child3
# 1 ABC123 1381956044 <NA>
# 2 <NA> 1397527137 <NA>
# 3 <NA> <NA> 4675
# 4 DEF456 <NA> 3735
# 5 <NA> <NA> 3735
对于data.table
结果,
> library(data.table)
> data.table(y)
# child1 child2 child3
# 1: ABC123 1381956044 NA
# 2: NA 1397527137 NA
# 3: NA NA 4675
# 4: DEF456 NA 3735
# 5: NA NA 3735
您可能希望使用colClasses
参数将列放入正确的类进行分析。
答案 1 :(得分:1)
在搜索@RichardScriven评论中提到的“列的重复下标”错误时,我发现了一个相关的问题:Import Infopath .XML forms into data frame in R。借用它,我修改了我最初的尝试来获得这个解决方案:
## Convert XML to list
xl <- xmlToList(x2)
#xl[sapply(xl, is.null)] <- NA
## Function that splits the XML path "key" on the last dot, to get the "table" and "column":
splitKey <- function(text) {
ss <- strsplit(text, "[.]")[[1]]
lss <- length(ss)
ifelse(lss==1, out <- c(NA,text), out <- c(paste0(ss[-lss], collapse="."), ss[lss]))
return(out)
}
## Put flattened list in a data.table, and add the table/column names for each key:
dt2 <- as.data.table(unlist(xl), keep.rownames=T)
setnames(dt2, c("key", "value"))
dt2[, c("table","column"):=as.list(splitKey(key)), by=key][]
## key value table column
## 1: grandparentInfo.grandparent1 foo grandparentInfo grandparent1
## 2: grandparentInfo.grandparent1 bar grandparentInfo grandparent1
## 3: grandparentInfo..attrs.junk TRUE grandparentInfo..attrs junk
## 4: parent.child1 ABC123 parent child1
## 5: parent.child2 1381956044 parent child2
## 6: parent.child2 1397527137 parent child2
## 7: parent.child3 4675 parent child3
## 8: parent.child1 DEF456 parent child1
## 9: parent.child3 3735 parent child3
## 10: parent.child3 3735 parent child3
## Get the number of elements within each "parent" section
## (this is the part I was missing in my original attempts):
newRows <- as.data.table(sapply(xl, length), keep.rownames=T)[V1=="parent", V2]
newRows
## [1] 2 1 1 2 2
## Subset the "parent" table, and add the correct row numbers:
dt2[table=="parent", row:=rep(seq_along(newRows),times=newRows)]
## Warning message:
## In `[.data.table`(dt2, table == "parent", `:=`(row, rep(seq_along(newRows), :
## Supplied 8 items to be assigned to 7 items of column 'row' (1 unused)
## Need to fix this answer to include null elements, since the `unlist` command seems to strip them out...
## Reshape from long to wide:
dcast.data.table(dt2[table=="parent"], row~column, value.var="value", fill=NA)
## row child1 child2 child3
## 1: 1 ABC123 1381956044 NA
## 2: 2 NA 1397527137 NA
## 3: 3 NA NA 4675
## 4: 4 DEF456 NA 3735
## 5: 5 NA NA 3735
感觉xmlToDataFrame
方法会比这更好,但我需要了解如何更好地对XML进行子集化以便使用它...