我有一个大矩阵mat
,其行名为group_label_x
和列名为group_label_y
。我想通过mat
和ave_mat
将group_label_x
聚合到group_label_y
中,其中ave_mat[i,j]
的值是mat[ group_label_x[i], group_label_y[j] ]
的平均值。这可以使用双forloop或应用两次aggregate
函数(aggregate( mat, by = list(group_label_x), FUN='mean' )
)来实现。但是,有没有什么方法可以达到更快的速度呢? (因为我有很多要汇总的矩阵)。
以下代码生成一个约1E4行和2E4列的演示随机矩阵,我希望将它们汇总成〜1E3 x 1E3矩阵:
set.seed(1)
dim_x_raw = 1E4
dim_y_raw = 2E4
n_groups_x = 1E3
n_groups_y = 1E3
group_len_x = diff(sort(sample( 1:dim_x_raw, n_groups_x )))
group_label_x = rep( paste0('group_', 1:length(group_len_x)), group_len_x )
group_len_y = diff(sort(sample( 1:dim_y_raw, n_groups_y )))
group_label_y = rep( paste0('group_', 1:length(group_len_y)), group_len_y )
mat = matrix( runif( length(group_label_x)*length(group_label_y) ), length(group_label_x) )
######################################
我的聚集代码(很慢):
ave_mat_x = aggregate( mat, by = list(group_label_x), FUN='mean' )
ave_mat = aggregate( t(ave_mat_x), by = list(group_label_y), FUN='mean' )
答案 0 :(得分:1)
您可以尝试
volatile
当然,您可以全部运行一行并检查速度
library(data.table)
# add row and colnames
mat = matrix(runif( length(group_label_x)*length(group_label_y)), length(group_label_x),
dimnames = list(group_label_x, group_label_y))
# transform to data.table
mat_dt <- data.table(mat, keep.rownames = TRUE, stringsAsFactors = FALSE)
rm(mat) #rmove the old matrix
# melt, summarise per group and calculate mean
mat_dt <- melt(mat_dt, id.vars = "rn")
head(mat_dt)
rn variable value
1: group_1 group_1 0.8718050
2: group_1 group_1 0.9671970
3: group_1 group_1 0.8669163
4: group_1 group_1 0.4377153
5: group_1 group_1 0.1919378
6: group_1 group_1 0.0822944
res <- mat_dt[,.(Mean=mean(value)),.(rn, variable)]
head(res)
rn variable Mean
1: group_1 group_1 0.4888935
2: group_2 group_1 0.3903115
3: group_3 group_1 0.4601481
4: group_4 group_1 0.5023852
5: group_5 group_1 0.5067483
6: group_6 group_1 0.4851856
dim(res)
[1] 998001 3