按列列出大型数据帧列表到一个数据帧

时间:2016-01-06 13:49:34

标签: r list merge dplyr

我需要合并一个大型列表(aprox 15个数据帧[16000x6])。 每个数据框都有2个id列"A""B"以及4个包含信息的列。

我希望在一个数据框中有前两个("A""B"加上15 * 4列。)

我在另一个问题中找到了这个:

Reduce(function(x,y) merge(x,y,by="your tag here"),your_list_here)

然而,这会导致我的机器崩溃,因为它需要太多RAM(仅使用带有3个dfs的列表!)

 In make.unique(as.character(rows)) :
  Reached total allocation of 4060Mb: see help(memory.size)

我认为必须有一个更好的策略,我从bind_cols包开始使用dplyr,它让我的数据框非常快,具有重复的A和B列。也许删除这些列,保持前两个,是一个更好的方法。

我为你提供了一个小玩具清单(Reduce(...)策略在这里工作,但我需要另一个解决方案)

dput(mylist)
structure(list(df1 = structure(list(A = c(1, 1, 2, 2, 3, 3), 
    B = c("Q", "Q", "Q", "P", "P", "P"), x1 = c(0.45840139570646, 
    0.0418491987511516, 0.798411589581519, 0.898478724062443, 
    0.064307059859857, 0.174364002654329), x2 = c(0.676136856665835, 
    0.494200984947383, 0.534940708894283, 0.220597118837759, 
    0.480761741055176, 0.0230771545320749)), .Names = c("A", 
"B", "x1", "x2"), row.names = c(NA, -6L), class = "data.frame"), 
    df2 = structure(list(A = c(1, 1, 2, 2, 3, 3), B = c("Q", 
    "Q", "Q", "P", "P", "P"), x1 = c(0.45840139570646, 0.0418491987511516, 
    0.798411589581519, 0.898478724062443, 0.064307059859857, 
    0.174364002654329), x2 = c(0.676136856665835, 0.494200984947383, 
    0.534940708894283, 0.220597118837759, 0.480761741055176, 
    0.0230771545320749)), .Names = c("A", "B", "x1", "x2"), row.names = c(NA, 
    -6L), class = "data.frame"), df3 = structure(list(A = c(1, 
    1, 2, 2, 3, 3), B = c("Q", "Q", "Q", "P", "P", "P"), x1 = c(0.45840139570646, 
    0.0418491987511516, 0.798411589581519, 0.898478724062443, 
    0.064307059859857, 0.174364002654329), x2 = c(0.676136856665835, 
    0.494200984947383, 0.534940708894283, 0.220597118837759, 
    0.480761741055176, 0.0230771545320749)), .Names = c("A", 
    "B", "x1", "x2"), row.names = c(NA, -6L), class = "data.frame")), .Names = c("df1", 
"df2", "df3"))

2 个答案:

答案 0 :(得分:3)

对于cbind - 你可以做的数据帧:

L <- mylist[[1]]
for (i in 2:length(mylist)) L <- cbind(L,  mylist[[i]][-(1:2)])

对于merge - (如前所示(但错误的)预期输出示例):

L <- mylist[[1]]
for (i in 2:length(mylist)) L <- merge(L,  mylist[[i]], by=c("A", "B"))

merge的情况下,我认为需要内存来自数据帧之间的m:n连接。另一个合并程序无法解决这个问题。

答案 1 :(得分:2)

根据评论说明您需要16,000 x 62 data.frame ...

首先cbind非ID列:

import pexpect, sys
f = open("sbt.log", "w+")
# Use `spawnu` for Python3, `spawn` otherwise
sbt = pexpect.spawnu("sbt -Dsbt.log.noformat=true \"version\" \"another-command\"", logfile=f)

# Do whatever is needed while sbt is running

# Force the process to expect EOF and file to be written
sbt.expect(pexpect.EOF)

然后添加&#34; A&#34;和&#34; B&#34;

tmp <- do.call(cbind, lapply(mylist, function(x) x[,-(1:2)]))

不需要合并,只需将data.frames打在一起

final <- cbind(mylist[[1]][,1:2], tmp)