我有两个列表,每个列表包含几千个数据表。数据表包含id,每个id只会在每个列表中出现一次。此外,每个数据表将具有不同的列,但它们将与其他一些数据表共享列名。例如,在我下面创建的列表中,id 1出现在list1中的第一个数据表中,而list2中出现第二个数据表。在第一个列表中,id 1包含列的数据' a'并且' d'在第二个列表中,它包含了' a'和' b'。
library(data.table)
# Create 2 lists of data frames
list1 <- list(data.table(id=c(1,3), a=c(0,0), d=c(1,1)),
data.table(id=c(2,4), b=c(1,0), c=c(2,1), f=c(3,1)),
data.table(id=c(5,6), a=c(4,0), b=c(2,1)))
list2 <- list(data.table(id=c(2,3,6), c=c(0,0,1), d=c(1,1,0), e=c(0,1,2)),
data.table(id=c(1,4,5), a=c(1,0,3), b=c(2,1,2)))
我需要做的是在每个列表中找到ID,并对其结果进行平均。
list id a b d
list1 1 0 NA 1
list2 1 1 2 NA
NA值被视为0,因此id 1的结果应为:
id a b d
1 0.5 1 0.5
接下来,根据它们的值选择并排序前3个列名称,以便得到结果:
id top3
1 b d a
这需要对所有id重复。我有代码可以实现这个(下面),但对于包含数千个数据表和超过一百万个ID的大型列表,它非常慢。
for (i in 1:6){ # i is the id to be searched for
for (j in 1:length(list1)){
if (i %in% list1[[j]]$id){
listnum1 <- j
rownum1 <- which(list1[[j]]$id==i)
break
}
}
for (j in 1:length(list2)){
if (i %in% list2[[j]]$id){
listnum2 <- j
rownum2 <- which(list2[[j]]$id==i)
break
}
}
v1 <- data.table(setDF(list1[[listnum1]])[rownum1,]) # Converting to data.frame using setDF and extracting the row is faster than using data.table
v2 <- data.table(setDF(list2[[listnum2]])[rownum2,])
bind <- rbind(v1, v2, fill=TRUE) # Combines two rows and fills in columns they don't have in common
for (j in 1:ncol(bind)){ # Convert NAs to 0
set(bind, which(is.na(bind[[j]])), j, 0)}
means <- colMeans(bind[,2:ncol(bind),with=F]) # Average the two rows
col_ids <- as.data.table(t(names(sort(means)[length(means):(length(means)-2)])))
# select and order the top 3 ids and bind to a data frame
top3 <- rbind(top3, cbind(id=i, top3=data.table(do.call("paste", c(col_ids[,1:min(length(col_ids),3),with=F], sep=" ")))))
}
id top3.V1
1: 1 b d a
2: 2 f c d
3: 3 d e c
4: 4 f c b
5: 5 a b
6: 6 e c b
当我在我的完整数据集(具有几百万个ID)上运行此代码时,它仅在大约60秒后通过大约400个ID。完成整个数据集需要数天时间。将每个列表转换为1个更大的数据表不是一种选择;有10万个可能的列,因此它变得太大了。有没有更快的方法来达到预期的效果?
答案 0 :(得分:4)
将个人data.table
融化,你不会遇到浪费记忆的问题:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id', variable.factor = F))[
# find number of "rows" per id
, nvals := max(rle(sort(variable))$lengths), by = id][
# compute the means, assuming that missing values are equal to 0
, sum(value)/nvals[1], by = .(id, variable)][
# extract top 3 values
order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]
# id V1
#1: 1 b a d
#2: 2 f c b
#3: 3 d e a
#4: 4 b c f
#5: 5 a b
#6: 6 e b c
或者你可以代替rle
:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
, .(vals = sum(value), nvals = .N), by = .(id, variable)][
, vals := vals / max(nvals), by = id][
order(-vals), paste(head(variable, 3), collapse = " "), keyby = id]
或者更好的是,正如弗兰克指出的那样,甚至不要理会这个意思:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
, sum(value), by = .(id, variable)][
order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]
答案 1 :(得分:0)
不确定性能,但这应该阻止for循环:
library(plyr)
library(dplyr)
a <- ldply(list1, data.frame)
b <- ldply(list2, data.frame)
dat <- full_join(a,b)
这将为您提供单一数据框:
id a d b c f e
1 1 0 1 NA NA NA NA
2 3 0 1 NA NA NA NA
3 2 NA NA 1 2 3 NA
4 4 NA NA 0 1 1 NA
5 5 4 NA 2 NA NA NA
6 6 0 NA 1 NA NA NA
7 2 NA 1 NA 0 NA 0
8 3 NA 1 NA 0 NA 1
9 6 NA 0 NA 1 NA 2
10 1 1 NA 2 NA NA NA
11 4 0 NA 1 NA NA NA
12 5 3 NA 2 NA NA NA
总结基于id:
means <- function(x) mean(x, na.rm=T)
output <- dat %>% group_by(id) %>% summarise_each(funs(means))
id a d b c f e
1 1 0.5 1 2.0 NA NA NA
2 2 NaN 1 1.0 1 3 0
3 3 0.0 1 NaN 0 NaN 1
4 4 0.0 NaN 0.5 1 1 NaN
5 5 3.5 NaN 2.0 NaN NaN NaN
6 6 0.0 0 1.0 1 NaN 2
通过sapply列出前3名将为您提供相同的结果表(但作为矩阵,每列对应于id)
sapply(1:nrow(output), function(x) sort(output[x,-1], decreasing=T)[1:3] %>% names)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "b" "f" "d" "c" "a" "e"
[2,] "d" "d" "e" "f" "b" "b"
[3,] "a" "b" "a" "b" NA "c"
**更新**
由于数据量很大,因此创建一些可以为每个id选择和组合适当data.frame的函数是谨慎的。
(i)找出两个列表中的所有id
id_list1 <- lapply(list1, "[[", "id")
id_list2 <- lapply(list2, "[[", "id")
(ii)找出哪个表1到6在列表中
id_l1<-lapply(1:6, function(x) sapply(id_list1, function(y) any(y==x) %>% unlist))
id_l2<-lapply(1:6, function(x) sapply(id_list2, function(y) any(y==x) %>% unlist))
(iii)创建一个函数来组合特定id的适当数据帧
id_who<-function(x){
a <- data.frame(list1[id_l1[[x]]])
a <- a[a$id==x, ]
b <- data.frame(list2[id_l2[[x]]])
b <- b[b$id==x, ]
full_join(a,b)
}
lapply(1:6, id_who)
[[1]]
id a d b
1 1 0 1 NA
2 1 1 NA 2
[[2]]
id b c f d e
1 2 1 2 3 NA NA
2 2 NA 0 NA 1 0
[[3]]
id a d c e
1 3 0 1 0 1
[[4]]
id b c f a
1 4 0 1 1 NA
2 4 1 NA NA 0
[[5]]
id a b
1 5 4 2
2 5 3 2
[[6]]
id a b c d e
1 6 0 1 1 0 2
output<-ldply(new, summarise_each, funs(means))
输出与上述相同。
这个过程的优点是你可以在(ii)或(iii)中轻松地在过程中加入逻辑中断。