快速计算R?中的加权按组均值的方法

时间:2018-08-29 19:17:11

标签: r optimization vectorization longitudinal

给出纵向数据,我如何计算矩阵,其中每一列代表给定变量的加权按组平均值?

我已经开发了一种需要循环的方法,而且速度太慢。我认为这可能可以向量化,但是解决方案使我难以理解。

这是我目前的做法:

  var myArray= [];
  for (var i = 0; i < myJson.length; i++) {
   myArray.push({
    id: myJson[i].id,
    name: myJson[i].name,
  });                                        
 }  

最初,我尝试使用Array(2) 0:{id: "1", name: "steve"} 1:{id: "2", name: "adam"} 来执行此操作,但它给我的输出是错误的!

library(foreach)

# N is sample size
# g is the number of groups
# p is the number of variables
get_group_mean_matrix <- function(N, g, p){
  X <- matrix(rbinom(N*p, 10, .5), N)
  f <- sort((1:(N)) %% g + 1)
  w <- runif(N)
  dmmat <- foreach(i = unique(f), .combine = rbind) %do% {
    idx <- which(f == i)
    ws <- w[idx]/sum(w[idx])
    t((t(X[idx,]) %*% ws)) %x% rep(1, length(idx))
  }
  dmmat
}

> set.seed(666)
> get_group_mean_matrix(12, 3, 5)
          [,1]     [,2]     [,3]     [,4]     [,5]
 [1,] 5.261103 4.074266 5.828070 4.452703 5.990165
 [2,] 5.261103 4.074266 5.828070 4.452703 5.990165
 [3,] 5.261103 4.074266 5.828070 4.452703 5.990165
 [4,] 5.261103 4.074266 5.828070 4.452703 5.990165
 [5,] 5.560556 4.241942 3.698828 5.572523 4.212532
 [6,] 5.560556 4.241942 3.698828 5.572523 4.212532
 [7,] 5.560556 4.241942 3.698828 5.572523 4.212532
 [8,] 5.560556 4.241942 3.698828 5.572523 4.212532
 [9,] 4.289029 4.771115 5.150607 4.424339 6.346775
[10,] 4.289029 4.771115 5.150607 4.424339 6.346775
[11,] 4.289029 4.771115 5.150607 4.424339 6.346775
[12,] 4.289029 4.771115 5.150607 4.424339 6.346775
> library(microbenchmark)
> microbenchmark(get_group_mean_matrix(1200, 300, 50))
Unit: milliseconds
                                 expr      min       lq     mean   median       uq      max neval
 get_group_mean_matrix(1200, 300, 50) 76.33337 77.39607 80.76586 78.39808 84.46984 93.40047   100

虽然快很多...

我将接受两种答案之一:

  1. 说明lfe::demeanlist在加权情况下的操作。从平均值中减去加权偏差后,是否应该获得加权平均值?知道了这一点,我该如何计算加权分组均值矩阵?
  2. 不使用行为举止器来计算加权均值矩阵的方法。

NB:使用library(lfe) get_group_mean_matrix_lfe <- function(N, g, p){ X <- matrix(rbinom(N*p, 10, .5), N) f <- sort((1:(N)) %% g + 1) w <- runif(N) X - demeanlist(X, list(factor(f)), weights = w) } > set.seed(666) > get_group_mean_matrix_lfe(12, 3, 5) [,1] [,2] [,3] [,4] [,5] [1,] 5.138068 4.001781 5.415467 4.722947 5.999827 [2,] 5.138068 4.001781 5.415467 4.722947 5.999827 [3,] 5.138068 4.001781 5.415467 4.722947 5.999827 [4,] 5.138068 4.001781 5.415467 4.722947 5.999827 [5,] 5.197308 4.067657 3.202478 5.866451 4.066385 [6,] 5.197308 4.067657 3.202478 5.866451 4.066385 [7,] 5.197308 4.067657 3.202478 5.866451 4.066385 [8,] 5.197308 4.067657 3.202478 5.866451 4.066385 [9,] 4.189951 4.887720 4.953305 4.501874 6.385846 [10,] 4.189951 4.887720 4.953305 4.501874 6.385846 [11,] 4.189951 4.887720 4.953305 4.501874 6.385846 [12,] 4.189951 4.887720 4.953305 4.501874 6.385846 > library(microbenchmark) > microbenchmark(get_group_mean_matrix_lfe(1200, 300, 50)) Unit: milliseconds expr min lq mean median uq max neval get_group_mean_matrix_lfe(1200, 300, 50) 6.107421 6.202426 6.500411 6.293648 6.582943 8.350876 100 用矩阵乘法函数替换lfe::demeanlist可以加快速度,但还不够。我认为问题是循环。

以下是一些示例输入:

%*%

其中RcppEigen是分组因子。

1 个答案:

答案 0 :(得分:1)

Hurr durr我所要做的就是将demeanlist hurr durr的权重平方根

library(foreach)
get_group_mean_matrix <- function(N, g, p){
  X <- matrix(rbinom(N*p, 10, .5), N)
  f <- sort((1:(N)) %% g + 1)
  w <- runif(N)
  dmmat <- foreach(i = unique(f), .combine = rbind) %do% {
    idx <- which(f == i)
    ws <- w[idx]/sum(w[idx])
    t((t(X[idx,]) %*% ws)) %x% rep(1, length(idx))
  }
  dmmat
}

set.seed(666)
A <- get_group_mean_matrix(12, 3, 5)

library(lfe)
get_group_mean_matrix_lfe <- function(N, g, p){
  X <- matrix(rbinom(N*p, 10, .5), N)
  f <- sort((1:(N)) %% g + 1)
  w <- runif(N)
  X - demeanlist(X, list(factor(f)), weights = w^.5)
}

set.seed(666)
B <- get_group_mean_matrix_lfe(12, 3, 5)

> all.equal(A, B)
[1] TRUE