跨数据框架的新数据框架

时间:2015-04-27 20:23:28

标签: r

我需要合并五个大约60列的数据帧。它们各自具有相同的列,我将它们与它们的手段相结合,因为它们代表相同的值。问题不在于组合它们的能力,而是有效地进行。以下是示例数据/代码:

#reproducible random data
set.seed(123)

dat1 <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))
dat2 <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))
dat3 <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))

#This works but is inefficient

final_data<-data.frame(a=rowMeans(cbind(dat1$a,dat2$a,dat3$a)),
                       b=rowMeans(cbind(dat1$b,dat2$b,dat3$b)),
                       c=rowMeans(cbind(dat1$c,dat2$c,dat3$c)),
                       d=rowMeans(cbind(dat1$d,dat2$d,dat3$d)),
                       e=rowMeans(cbind(dat1$e,dat2$e,dat3$e)),
                       f=rowMeans(cbind(dat1$f,dat2$f,dat3$f))
)
#what results should look like
head(final_data)
#             a           b          c           d            e           f
# 1 0.573813625  0.17695841 -0.1434628 -0.53673101  0.353906578  0.24262067
# 2 0.135689926 -0.69206908  0.2888584 -0.37215810 -0.038298083 -0.23317107
# 3 0.004068807  0.44666945  0.5205118  0.09587453 -0.308528454  0.30516883
# 4 0.347100292  0.02401646  0.1409754 -0.15931120  0.587047386 -0.08684867
# 5 0.006529998  0.09010946  0.4932670  0.62606230 -0.005235813 -0.36967000
# 6 0.240225778 -0.45824825 -0.5000004  0.66131121  0.619480608  0.55650611

这里的问题是我不想为新数据框中的60列中的每一列重写a=rowMeans(cbind(dat1$a,dat2$a,dat3$a))。你能想出一个很好的方法来解决这个问题吗?

编辑:我将接受以下答案,因为它允许我设置列以应用它 -

final_data1<-as.data.frame(sapply(colnames(dat1),function(i)
    rowMeans(cbind(dat1[,i],dat2[,i],dat3[,i]))))

> identical(final_data1,final_data)
[1] TRUE

5 个答案:

答案 0 :(得分:3)

我会使用rbind将所有数据集合并到一个数据集中,然后使用data.table计算列数(用于速度)

library(data.table)
df <- rbind(dat1, dat2, dat3)
indx <- seq_len(nrow(df)) %% nrow(dat1)  
setDT(df)[, lapply(.SD, mean), by = indx]

这种方法最好的一点是,一旦所有数据集合在一个数据集中,您就可以计算各种函数(不只是mean)而无需每次调用​​cbind。使用.SDcols参数在特定列上运行操作也很容易,例如

cols <- names(df)[c(1,3:4)]
df[, lapply(.SD, mean), .SDcols = cols, by = indx]

答案 1 :(得分:3)

这个怎么样?

(dat1+dat2+dat3)/3

或者,要首先选择/重新排序列的子集,然后然后添加生成的data.frames,您可以这样做:

jj <- letters[1:6]
Reduce(`+`, lapply(list(dat1,dat2,dat3), `[`, jj))/3

答案 2 :(得分:2)

试试这个:

sapply(colnames(dat1),function(i)
  rowMeans(cbind(dat1[,i],dat2[,i],dat3[,i])))

答案 3 :(得分:1)

您也可以尝试:

mapply(function(x,y,z) rowMeans(cbind(x,y,z)), dat1, dat2, dat3)

答案 4 :(得分:1)

以下是另一项试验。

lst <- list(dat1, dat2, dat3)
bind <- do.call(cbind, lst)
sapply(colnames(dat1), function(x) {
  rowMeans(bind[, colnames(bind) == x])
})
a           b            c           d           e            f
[1,] -0.69651939 -0.43495675  0.267416865  0.48329853  0.61255811 -1.505583996
[2,] -0.07074860  0.09862994 -0.003961269  0.73806156 -0.80865458 -1.367104216
[3,] -0.90342272 -0.62873624  0.260394162 -0.28607083  1.10855838 -1.073984557
[4,] -0.05890636  0.81463842 -0.227212609  0.21552260 -0.20440539 -0.071603144
[5,]  0.34237648  0.11332086 -0.673674065 -0.17747223  0.21157555  0.641724519
[6,] -0.15563697 -0.10291304  0.334530993 -0.42936296  0.16148849  0.635475661
[7,]  0.05404325  1.36754458 -0.375816720  0.20686341  0.78680115  0.553046376
[8,] -0.73117177  0.92057378  0.501956982  0.70190124  0.69835069  0.350644246
[9,]  0.17803759  0.04951559 -1.098479453 -0.26502658 -0.61354619  1.027449014
[10,] -0.48196619  0.11175892 -0.179521990 -0.75229105  0.31444472  0.083272675
[11,] -0.32993871 -0.01253952 -0.585723144  0.70656176 -0.32358449 -0.252437496
[12,] -0.96078171  1.44073015  0.221025206  0.30641093 -0.89929299  0.005243541
[13,]  0.03855730 -0.07904409  0.579366082  0.87307855  0.08949804  0.023818143
[14,] -0.28243416  0.68603908 -0.046795603 -0.09192619  0.26275774  0.594420728
[15,] -0.83591175 -0.62040012  0.598931246 -0.22719000  0.50836421 -0.135153053
[16,] -0.55951822  0.42339116  0.162560131 -0.08010072  0.79547162 -0.334898253