R引导加权意味着按组分组,数据表为

时间:2018-02-20 13:23:57

标签: r data.table statistics-bootstrap

我正在尝试将两种方法结合起来:

  1. Bootstrapping multiple columns in data.table in a scalable fashion
    1. Bootstrap weighted mean in R
    2. 以下是一些随机数据:

      ## Generate sample data
      
      # Function to randomly generate weights
      set.seed(7)
      rtnorm <- function(n, mean, sd, a = -Inf, b = Inf){
      qnorm(runif(n, pnorm(a, mean, sd), pnorm(b, mean, sd)), mean, sd)
      }
      
      # Generate variables
      nps    <- round(runif(3500, min=-1, max=1), 0) # nps value which takes 1, 0 or -1
      group  <- sample(letters[1:11], 3500, TRUE) # groups
      weight <- rtnorm(n=3500, mean=1, sd=1, a=0.04, b=16) # weights between 0.04 and 16
      
      # Build data frame
      df = data.frame(group, nps, weight)
      
      # The following packages / libraries are required:
      require("data.table")
      require("boot")
      

      这是上面提到的加权平均值的第一篇文章中的代码:

      samplewmean <- function(d, i, j) {
        d <- d[i, ]
        w <- j[i, ]
        return(weighted.mean(d, w))   
      }
      
      results_qsec <- boot(data= df[, 2, drop = FALSE], 
                           statistic = samplewmean, 
                           R=10000, 
                           j = df[, 3 , drop = FALSE])
      

      这完全没问题。

      下面是第二篇文章中的代码,用数据表中的组引导平均值:

      dt = data.table(df)
      stat <- function(x, i) {x[i, (m=mean(nps))]}
      dt[, list(list(boot(.SD, stat, R = 100))), by = group]$V1
      

      这也可以。

      我无法结合两种方法:

      跑步......

      dt[, list(list(boot(.SD, samplewmean, R = 5000, j = dt[, 3 , drop = FALSE]))), by = group]$V1
      

      ...显示错误消息:

      Error in weighted.mean.default(d, w) : 
        'x' and 'w' must have the same length
      

      跑步......

      dt[, list(list(boot(dt[, 2 , drop = FALSE], samplewmean, R = 5000, j = dt[, 3 , drop = FALSE]))), by = group]$V1
      

      ......提出了一个不同的错误:

      Error in weighted.mean.default(d, w) : 
        (list) object cannot be coerced to type 'double'
      

      我仍然无法理解data.table中的参数以及如何组合运行data.table的函数。

      我将不胜感激。

1 个答案:

答案 0 :(得分:2)

它与data.table在函数范围内的行为有关。即使在使用data.table进行子集化之后,d仍然是samplewmean i,而weighted.mean期待权重和值的数字向量。如果您在致电unlist之前weighted.mean,则可以修复此错误

  

weighted.mean.default(d,w)出错:     (list)对象无法强制键入&#39; double&#39;

在传递到weighted.mean之前要取消列出的代码:

samplewmean <- function(d, i, j) {
  d <- d[i, ]
  w <- j[i, ]
  return(weighted.mean(unlist(d), unlist(w)))   
}

dt[, list(list(boot(dt[, 2 , drop = FALSE], samplewmean, R = 5000, j = dt[, 3 , drop = FALSE]))), by = group]$V1

更多data.table - like(data.table version&gt; = v1.10.2)语法可能如下:

#a variable named original is being passed in from somewhere and i am unable to figure out from where
samplewmean <- function(d, valCol, wgtCol, original) {
    weighted.mean(unlist(d[, ..valCol]), unlist(d[, ..wgtCol]))
}

dt[, list(list(boot(.SD, statistic=samplewmean, R=1, valCol="nps", wgtCol="weight"))), by=group]$V1

或另一种可能的语法是:(见data.table faq 1.6

samplewmean <- function(d, valCol, wgtCol, original) {
    weighted.mean(unlist(d[, eval(substitute(valCol))]), unlist(d[, eval(substitute(wgtCol))]))
}

dt[, list(list(boot(.SD, statistic=samplewmean, R=1, valCol=nps, wgtCol=weight))), by=group]$V1