在2个数据表列表中组合公共ID

时间:2016-05-23 06:18:25

标签: r data.table

我有两个列表,每个列表包含几千个数据表。数据表包含id,每个id只会在每个列表中出现一次。此外,每个数据表将具有不同的列,但它们将与其他一些数据表共享列名。例如,在我下面创建的列表中,id 1出现在list1中的第一个数据表中,而list2中出现第二个数据表。在第一个列表中,id 1包含列的数据' a'并且' d'在第二个列表中,它包含了' a'和' b'。

library(data.table)
# Create 2 lists of data frames
list1 <- list(data.table(id=c(1,3), a=c(0,0), d=c(1,1)),
              data.table(id=c(2,4), b=c(1,0), c=c(2,1), f=c(3,1)),
              data.table(id=c(5,6), a=c(4,0), b=c(2,1)))

list2 <- list(data.table(id=c(2,3,6), c=c(0,0,1), d=c(1,1,0), e=c(0,1,2)),
              data.table(id=c(1,4,5), a=c(1,0,3), b=c(2,1,2)))

我需要做的是在每个列表中找到ID,并对其结果进行平均。

 list id  a  b  d
list1  1  0 NA  1
list2  1  1  2 NA

NA值被视为0,因此id 1的结果应为:

id   a b   d
 1 0.5 1 0.5

接下来,根据它们的值选择并排序前3个列名称,以便得到结果:

id    top3
 1   b d a

这需要对所有id重复。我有代码可以实现这个(下面),但对于包含数千个数据表和超过一百万个ID的大型列表,它非常慢。

for (i in 1:6){ # i is the id to be searched for
  for (j in 1:length(list1)){
    if (i %in% list1[[j]]$id){
      listnum1 <- j
      rownum1 <- which(list1[[j]]$id==i)
      break
    }
  }

  for (j in 1:length(list2)){
    if (i %in% list2[[j]]$id){
      listnum2 <- j
      rownum2 <- which(list2[[j]]$id==i)
      break
    }
  }

  v1 <- data.table(setDF(list1[[listnum1]])[rownum1,]) # Converting to data.frame using setDF and extracting the row is faster than using data.table
  v2 <- data.table(setDF(list2[[listnum2]])[rownum2,])
  bind <- rbind(v1, v2, fill=TRUE) # Combines two rows and fills in columns they don't have in common
  for (j in 1:ncol(bind)){ # Convert NAs to 0
    set(bind, which(is.na(bind[[j]])), j, 0)}
  means <- colMeans(bind[,2:ncol(bind),with=F]) # Average the two rows
  col_ids <- as.data.table(t(names(sort(means)[length(means):(length(means)-2)])))

  # select and order the top 3 ids and bind to a data frame
  top3 <- rbind(top3, cbind(id=i, top3=data.table(do.call("paste", c(col_ids[,1:min(length(col_ids),3),with=F], sep=" ")))))
}

   id top3.V1
1:  1   b d a
2:  2   f c d
3:  3   d e c
4:  4   f c b
5:  5     a b
6:  6   e c b

当我在我的完整数据集(具有几百万个ID)上运行此代码时,它仅在大约60秒后通过大约400个ID。完成整个数据集需要数天时间。将每个列表转换为1个更大的数据表不是一种选择;有10万个可能的列,因此它变得太大了。有没有更快的方法来达到预期的效果?

2 个答案:

答案 0 :(得分:4)

将个人data.table融化,你不会遇到浪费记忆的问题:

rbindlist(lapply(c(list1, list2), melt, id.var = 'id', variable.factor = F))[
  # find number of "rows" per id
  , nvals := max(rle(sort(variable))$lengths), by = id][
  # compute the means, assuming that missing values are equal to 0
  , sum(value)/nvals[1], by = .(id, variable)][
  # extract top 3 values
  order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]
#   id    V1
#1:  1 b a d
#2:  2 f c b
#3:  3 d e a
#4:  4 b c f
#5:  5   a b
#6:  6 e b c

或者你可以代替rle

rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
  , .(vals = sum(value), nvals = .N), by = .(id, variable)][
  , vals := vals / max(nvals), by = id][
  order(-vals), paste(head(variable, 3), collapse = " "), keyby = id]

或者更好的是,正如弗兰克指出的那样,甚至不要理会这个意思:

rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
  , sum(value), by = .(id, variable)][
  order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]

答案 1 :(得分:0)

不确定性能,但这应该阻止for循环:

library(plyr)
library(dplyr)
a <- ldply(list1, data.frame)
b <- ldply(list2, data.frame)
dat <- full_join(a,b)

这将为您提供单一数据框:

   id  a  d  b  c  f  e
1   1  0  1 NA NA NA NA
2   3  0  1 NA NA NA NA
3   2 NA NA  1  2  3 NA
4   4 NA NA  0  1  1 NA
5   5  4 NA  2 NA NA NA
6   6  0 NA  1 NA NA NA
7   2 NA  1 NA  0 NA  0
8   3 NA  1 NA  0 NA  1
9   6 NA  0 NA  1 NA  2
10  1  1 NA  2 NA NA NA
11  4  0 NA  1 NA NA NA
12  5  3 NA  2 NA NA NA

总结基于id:

means <- function(x) mean(x, na.rm=T)
output <- dat %>% group_by(id) %>% summarise_each(funs(means))

     id     a     d     b     c     f     e
1     1   0.5     1   2.0    NA    NA    NA
2     2   NaN     1   1.0     1     3     0
3     3   0.0     1   NaN     0   NaN     1
4     4   0.0   NaN   0.5     1     1   NaN
5     5   3.5   NaN   2.0   NaN   NaN   NaN
6     6   0.0     0   1.0     1   NaN     2

通过sapply列出前3名将为您提供相同的结果表(但作为矩阵,每列对应于id)

 sapply(1:nrow(output), function(x) sort(output[x,-1], decreasing=T)[1:3] %>% names)
    [,1] [,2] [,3] [,4] [,5] [,6]
[1,] "b"  "f"  "d"  "c"  "a"  "e" 
[2,] "d"  "d"  "e"  "f"  "b"  "b" 
[3,] "a"  "b"  "a"  "b"  NA   "c" 

**更新**

由于数据量很大,因此创建一些可以为每个id选择和组合适当data.frame的函数是谨慎的。

(i)找出两个列表中的所有id

id_list1 <- lapply(list1, "[[", "id")
id_list2 <- lapply(list2, "[[", "id")

(ii)找出哪个表1到6在列表中

id_l1<-lapply(1:6, function(x) sapply(id_list1, function(y) any(y==x) %>% unlist)) 
id_l2<-lapply(1:6, function(x) sapply(id_list2, function(y) any(y==x) %>% unlist)) 

(iii)创建一个函数来组合特定id的适当数据帧

id_who<-function(x){
  a <- data.frame(list1[id_l1[[x]]])
  a <- a[a$id==x, ]
  b <- data.frame(list2[id_l2[[x]]]) 
  b <- b[b$id==x, ]
  full_join(a,b)
} 

lapply(1:6, id_who)
[[1]]
  id a  d  b
1  1 0  1 NA
2  1 1 NA  2

[[2]]
  id  b c  f  d  e
1  2  1 2  3 NA NA
2  2 NA 0 NA  1  0

[[3]]
  id a d c e
1  3 0 1 0 1

[[4]]
  id b  c  f  a
1  4 0  1  1 NA
2  4 1 NA NA  0

[[5]]
  id a b
1  5 4 2
2  5 3 2

[[6]]
  id a b c d e
1  6 0 1 1 0 2

output<-ldply(new, summarise_each, funs(means))

输出与上述相同。

这个过程的优点是你可以在(ii)或(iii)中轻松地在过程中加入逻辑中断。