我有以下格式的列表:
[[1]]
[[1]]$a
[1] 1
[[1]]$b
[1] 3
[[1]]$c
[1] 5
[[2]]
[[2]]$c
[1] 2
[[2]]$a
[1] 3
有一个预定义的可能“键”列表(在这种情况下为a
,b
和c
),列表中的每个元素(“行”)都会有为一个或多个这些键定义的值。我正在寻找一种从上面的列表结构到数据框架的快速方法,在这种情况下,它将如下所示:
a b c
1 1 3 5
2 3 NA 2
任何帮助将不胜感激!
附录
我正在处理一个最多有50,000行和3-6列的表,并指定了大部分值。我将从JSON中获取表格并尝试快速将其转换为data.frame结构。
以下是一些代码,用于创建我将使用的比例的样本列表:
ids <- c("a", "b", "c")
createList <- function(approxSize=100){
set.seed(1234)
fifth <- round(approxSize/5)
list <- list()
list[1:(fifth*5)] <- rep(
list(list(a=1, b=2, c=3),
list(a=3, b=4, c=5),
list(a=7, c=9),
list(c=6, a=8, b=3),
list(b=6)),
fifth)
list
}
只需创建一个approxSize
为50,000的列表,即可在此尺寸列表中测试效果。
答案 0 :(得分:9)
这是一个简短的答案,我怀疑它会非常快。
> library(plyr)
> rbind.fill(lapply(x, as.data.frame))
a b c
1 1 3 5
2 3 NA 2
答案 1 :(得分:9)
这是我最初的想法。它并没有加快你的方法,但它确实大大简化了代码:
# makeDF <- function(List, Names) {
# m <- t(sapply(List, function(X) unlist(X)[Names],
# as.data.frame(m)
# }
## vapply() is a bit faster than sapply()
makeDF <- function(List, Names) {
m <- t(vapply(List,
FUN = function(X) unlist(X)[Names],
FUN.VALUE = numeric(length(Names))))
as.data.frame(m)
}
## Test timing with a 50k-item list
ll <- createList(50000)
nms <- c("a", "b", "c")
system.time(makeDF(ll, nms))
# user system elapsed
# 0.47 0.00 0.47
答案 2 :(得分:3)
如果您事先知道可能的值,并且您正在处理大数据,那么使用data.table
和set
可能会很快
cc <- createList(50000)
system.time({
nas <- rep.int(NA_real_, length(cc))
DT <- setnames(as.data.table(replicate(length(ids),nas, simplify = FALSE)), ids)
for(xx in seq_along(cc)){
.n <- names(cc[[xx]])
for(j in .n){
set(DT, i = xx, j = j, value = cc[[xx]][[j]])
}
}
})
# user system elapsed
# 0.68 0.01 0.70
full <- c('a','b', 'c')
system.time({
for(xx in seq_along(cc)) {
mm <- setdiff(full, names(cc[[xx]]))
if(length(mm) || all(names(cc[[xx]]) == full)){
cc[[xx]] <- as.data.table(cc[[xx]])
# any missing columns
if(length(mm)){
# if required add additional columns
cc[[xx]][, (mm) := as.list(rep(NA_real_, length(mm)))]
}
# put columns in correct order
setcolorder(cc[[xx]], full)
}
}
cdt <- rbindlist(cc)
})
# user system elapsed
# 21.83 0.06 22.00
第二个解决方案已留在此处,以说明data.table
如何使用效果不佳。
答案 3 :(得分:2)
好吧,我第一次尝试了,性能并不像我担心的那么糟糕,但我确信还有改进的余地(特别是在浪费矩阵 - &gt; data.frame转换)
convertList <- function(myList, ids){
#this computes a list of the numerical index for each value to handle the missing/
# improperly ordered list elements. So it will have a list in which each element
# associated with A has a value of 1, B ->2, and C -> 3. So a row containing
# A=_, C=_, B=_ would have a value of `1,3,2`
idInd <- lapply(myList, function(x){match(names(x), ids)})
# Calculate the row indices if I were to unlist myList. So if there were two elements
# in the first row, 3 in the third, and 1 in the fourth, you'd see: 1, 1, 2, 2, 2, 3
rowInd <- inverse.rle(list(values=1:length(myList), lengths=sapply(myList, length)))
#Unlist the first list created to just be a numerical matrix
idInd <- unlist(idInd)
#create a grid of addresses. The first column is the row address, the second is the col
address <- cbind(rowInd, idInd)
#have to use a matrix because you can't assign a data.frame
# using an addressing table like we have above
mat <- matrix(ncol=length(ids), nrow=length(myList))
# assign the values to the addresses in the matrix
mat[address] <- unlist(myList)
# convert to data.frame
df <- as.data.frame(mat)
colnames(df) <- ids
df
}
myList <- createList(50000)
ids <- letters[1:3]
system.time(df <- convertList(myList, ids))
在笔记本电脑上转换50,000行(Windows 7,Intel i7 M620 @ 2.67 GHz,4GB RAM)大约需要0.29秒。
对其他答案仍然非常感兴趣!
答案 4 :(得分:2)
我知道这是一个古老的问题,但我刚刚遇到过这个问题,并且看到我所知道的最简单的解决方案并不令人痛苦。所以这里(简单地指定&#39; fill = TRUE&#39;在rbindlist中):
library(data.table)
list = list(list(a=1,b=3,c=5),list(c=2,a=3))
rbindlist(list,fill=TRUE)
# a b c
# 1: 1 3 5
# 2: 3 NA 2
我不知道这是否是最快的方式,但我愿意打赌它会竞争,因为data.table的设计周到,并且在很多其他方面表现非常出色任务。
答案 5 :(得分:0)
在dplyr:
bind_rows(lapply(x, as_data_frame))
# A tibble: 2 x 3
a b c
<dbl> <dbl> <dbl>
1 1 3 5
2 3 NA 2