R:data.table计算多个变量的加权平均值,每个变量具有多个权重变量

时间:2016-10-26 21:28:36

标签: r list data.table weighted-average

我还是data.table的新手。我的问题类似于this onethis one。区别在于我想按组计算多个变量的加权平均值,但每个均值使用多个权重。

考虑以下data.table(实际情况要大得多):

library(data.table)

set.seed(123456)

mydata <- data.table(CLID = rep("CNK", 10),
                     ITNUM = rep(c("First", "Second", "First", "First", "Second"), 2),
                     SATS = rep(c("Always", "Amost always", "Sometimes", "Never", "Always"), 2),
                     ASSETS = rep(c("0-10", "11-25", "26-100", "101-200", "MORE THAN 200"), 2),
                     AVGVALUE1 = rnorm(10, 10, 2),
                     AVGVALUE2 = rnorm(10, 10, 2),
                     WGT1 = rnorm(10, 3, 1),
                     WGT2 = rnorm(10, 3, 1),
                     WGT3 = rnorm(10, 3, 1))

#I set the key of the table to the variables I want to group by,
#so the output is sorted
setkeyv(mydata, c("CLID", "ITNUM", "SATS", "ASSETS"))

我想要实现的是按AVGVALUE1AVGVALUE2,{{定义的组计算ITNUMSATS(以及可能更多变量)的加权平均值1}}使用每个权重变量ASSETSWGT1WGT2(可能还有更多)。因此,对于我想要计算加权平均值的每个变量,我将有三个加权平均值(或任何权重数)。

我可以分别为每个变量执行此操作,例如:

WGT3

我在all.weights <- c("WGT1", "WGT2", "WGT3") avg.var <- "AVGVALUE1" split.vars <- c("ITNUM", "SATS", "ASSETS") mydata[ , Map(f = weighted.mean, x = .(get(avg.var)), w = mget(all.weights), na.rm = TRUE), by = c(key(mydata)[1], split.vars)] 中添加了第一个键变量,虽然它是一个常量,因为我希望将它作为输出中的列。我得到了:

by

然而,对于实际的 CLID ITNUM SATS ASSETS V1 V2 V3 1: CNK First Always 0-10 11.66824 11.66819 11.66829 2: CNK First Never 101-200 11.37378 12.21008 11.60182 3: CNK First Sometimes 26-100 12.43004 13.13450 12.01330 4: CNK Second Always MORE THAN 200 12.32265 11.81613 12.56786 5: CNK Second Amost always 11-25 10.76556 11.34669 10.52458 ,我有更多的列来计算加权平均值(以及更多要使用的权重),一个接一个地执行它将是相当麻烦的。我想象的是一个函数,其中每个变量(data.tableAVGVALUE1等)的均值用每个权重变量(AVGVALUE2WGT1计算,WGT2等等,并将计算加权平均值的每个变量的输出添加到列表中。我猜这个列表是最好的选择,因为如果所有的估计都在同一个输出中,那么列的数量可能是无穷无尽的。所以像这样:

WGT3

到目前为止我尝试了什么:

  1. 使用[[1]] CLID ITNUM SATS ASSETS V1 V2 V3 1: CNK First Always 0-10 11.66824 11.66819 11.66829 2: CNK First Never 101-200 11.37378 12.21008 11.60182 3: CNK First Sometimes 26-100 12.43004 13.13450 12.01330 4: CNK Second Always MORE THAN 200 12.32265 11.81613 12.56786 5: CNK Second Amost always 11-25 10.76556 11.34669 10.52458 [[2]] CLID ITNUM SATS ASSETS V1 V2 V3 1: CNK First Always 0-10 9.132899 9.060045 9.197005 2: CNK First Never 101-200 12.896584 13.278680 13.000772 3: CNK First Sometimes 26-100 10.972260 11.215390 10.828431 4: CNK Second Always MORE THAN 200 11.704404 11.611072 11.749586 5: CNK Second Amost always 11-25 8.086409 8.225030 8.028928

    lapply
  2. 使用all.weights <- c("WGT1", "WGT2", "WGT3") avg.vars <- c("AVGVALUE1", "AVGVALUE2") split.vars <- c("ITNUM", "SATS", "ASSETS") lapply(mydata, function(i) { mydata[ , Map(f = weighted.mean, x = mget(avg.vars)[i], w = mget(all.weights), na.rm = TRUE), by = c(key(mydata)[1], split.vars)] }) Error in weighted.mean.default(x = dots[[1L]][[1L]], w = dots[[2L]][[1L]], : 'x' and 'w' must have the same length

    mapply
  3. 我尝试将myfun <- function(data, spl.v, avg.v, wgts) { data[ , Map(f = weighted.mean, x = mget(avg.v), w = mget(all.weights), na.rm = TRUE), by = c(key(data)[1], spl.v)] } mapply(FUN = myfun, data = mydata, spl.v = split.vars, avg.v = avg.vars, wgts = all.weights) Error: value for ‘AVGVALUE2’ not found 包装为列表 - mget(avg.v),但后来又出现了另一个错误:

    .(mget(avg.v))

    有人可以帮忙吗?

2 个答案:

答案 0 :(得分:2)

我们可以使用outer(对两个输入向量中的值的所有组合执行函数)在向量化加权平均函数上运算。通过在数据表范围内定义outer使用的函数,我们可以get计算data.table列:

wmeans = mydata[, {
  f  = function(X,Y) weighted.mean(get(X), get(Y));
  vf = Vectorize(f);
  outer(avg.var, all.weights, vf)},
  by = split.vars]

这将所有手段放入一个列(即“长”格式)。我们还可以添加几列来指定每个引用的值/权重组合:

wmeans[, mean.v := expand.grid(avg.var, all.weights)[,1]]       
wmeans[, mean.w := expand.grid(avg.var, all.weights)[,2]]
head(wmeans)
#    ITNUM   SATS ASSETS        V1    mean.v mean.w
# 1: First Always   0-10 11.668243 AVGVALUE1   WGT1
# 2: First Always   0-10  9.132899 AVGVALUE2   WGT1
# 3: First Always   0-10 11.668192 AVGVALUE1   WGT2
# 4: First Always   0-10  9.060045 AVGVALUE2   WGT2
# 5: First Always   0-10 11.668287 AVGVALUE1   WGT3
# 6: First Always   0-10  9.197005 AVGVALUE2   WGT3

我们可以使用dcast将其重新整形为一个在avg.var中很长但在all.weights中很宽的data.table:

wide.wmeans = dcast(wmeans, mean.v+ITNUM+SATS+ASSETS ~ mean.w, value.var = "V1")  
#       mean.v  ITNUM         SATS        ASSETS      WGT1      WGT2      WGT3
# 1: AVGVALUE1  First       Always          0-10 11.668243 11.668192 11.668287
# 2: AVGVALUE1  First        Never       101-200 11.373780 12.210083 11.601819
# 3: AVGVALUE1  First    Sometimes        26-100 12.430039 13.134499 12.013299
# 4: AVGVALUE1 Second       Always MORE THAN 200 12.322651 11.816135 12.567860
# 5: AVGVALUE1 Second Amost always         11-25 10.765557 11.346688 10.524583
# 6: AVGVALUE2  First       Always          0-10  9.132899  9.060045  9.197005
# 7: AVGVALUE2  First        Never       101-200 12.896584 13.278680 13.000772
# 8: AVGVALUE2  First    Sometimes        26-100 10.972260 11.215390 10.828431
# 9: AVGVALUE2 Second       Always MORE THAN 200 11.704404 11.611072 11.749586
#10: AVGVALUE2 Second Amost always         11-25  8.086409  8.225030  8.028928

如果您需要将其作为列表而不是data.table,则可以使用

将其拆分
lapply(avg.var, function(x) wide.wmeans[mean.v == x])
# [[1]]
#       mean.v  ITNUM         SATS        ASSETS     WGT1     WGT2     WGT3
# 1: AVGVALUE1  First       Always          0-10 11.66824 11.66819 11.66829
# 2: AVGVALUE1  First        Never       101-200 11.37378 12.21008 11.60182
# 3: AVGVALUE1  First    Sometimes        26-100 12.43004 13.13450 12.01330
# 4: AVGVALUE1 Second       Always MORE THAN 200 12.32265 11.81613 12.56786
# 5: AVGVALUE1 Second Amost always         11-25 10.76556 11.34669 10.52458
# 
# [[2]]
#       mean.v  ITNUM         SATS        ASSETS      WGT1      WGT2      WGT3
# 1: AVGVALUE2  First       Always          0-10  9.132899  9.060045  9.197005
# 2: AVGVALUE2  First        Never       101-200 12.896584 13.278680 13.000772
# 3: AVGVALUE2  First    Sometimes        26-100 10.972260 11.215390 10.828431
# 4: AVGVALUE2 Second       Always MORE THAN 200 11.704404 11.611072 11.749586
# 5: AVGVALUE2 Second Amost always         11-25  8.086409  8.225030  8.028928

答案 1 :(得分:0)

<强>予。 lapply解决方案

all.weights <- c("WGT1", "WGT2", "WGT3")
avg.vars    <- c("AVGVALUE1", "AVGVALUE2")
split.vars  <- c("ITNUM", "SATS", "ASSETS")

myfun <- function(avg.vars){
  tmp <-
    mydata[ , Map(f = weighted.mean, 
                x = .(get(avg.vars)), 
                w = mget(all.weights),
                na.rm = TRUE), 
          by = c(key(mydata)[1], split.vars)]  

  return(tmp) # totally optional, a habit from using C and Java
}

lapply(avg.vars, myfun)

向上侧:

  • 使用* apply
  • 避免循环
  • 比逐一做的更快

向下侧:

  • 返回列表
[[1]]
   CLID  ITNUM         SATS        ASSETS       V1       V2       V3
1:  CNK  First       Always          0-10 11.66824 11.66819 11.66829
2:  CNK  First        Never       101-200 11.37378 12.21008 11.60182
3:  CNK  First    Sometimes        26-100 12.43004 13.13450 12.01330
4:  CNK Second       Always MORE THAN 200 12.32265 11.81613 12.56786
5:  CNK Second Amost always         11-25 10.76556 11.34669 10.52458

[[2]]
   CLID  ITNUM         SATS        ASSETS        V1        V2        V3
1:  CNK  First       Always          0-10  9.132899  9.060045  9.197005
2:  CNK  First        Never       101-200 12.896584 13.278680 13.000772
3:  CNK  First    Sometimes        26-100 10.972260 11.215390 10.828431
4:  CNK Second       Always MORE THAN 200 11.704404 11.611072 11.749586
5:  CNK Second Amost always         11-25  8.086409  8.225030  8.028928

<强> II。 for循环解决方案

使用简单的for循环,其中avg.vars有2个值的示例:

all.weights <- c("WGT1", "WGT2", "WGT3")
avg.vars    <- c("AVGVALUE1", "AVGVALUE2")
split.vars  <- c("ITNUM", "SATS", "ASSETS")

result <- data.frame(matrix(nrow=0,ncol=7))
for(i in avg.vars){
  tmp <- 
    mydata[ , Map(f = weighted.mean, 
                x = .(get(i)), 
                w = mget(all.weights),
                na.rm = TRUE), 
          by = c(key(mydata)[1], split.vars)]  

  result <- rbind(result,tmp,use.names=F)
}
colnames(result) <- c("CLID", "ITNUM", "SATS", "ASSETS", "V1", "V2", "V3")
result
    CLID  ITNUM         SATS        ASSETS        V1        V2        V3
 1:  CNK  First       Always          0-10 11.668243 11.668192 11.668287
 2:  CNK  First        Never       101-200 11.373780 12.210083 11.601819
 3:  CNK  First    Sometimes        26-100 12.430039 13.134499 12.013299
 4:  CNK Second       Always MORE THAN 200 12.322651 11.816135 12.567860
 5:  CNK Second Amost always         11-25 10.765557 11.346688 10.524583
 6:  CNK  First       Always          0-10  9.132899  9.060045  9.197005
 7:  CNK  First        Never       101-200 12.896584 13.278680 13.000772
 8:  CNK  First    Sometimes        26-100 10.972260 11.215390 10.828431
 9:  CNK Second       Always MORE THAN 200 11.704404 11.611072 11.749586
10:  CNK Second Amost always         11-25  8.086409  8.225030  8.028928

向上侧:

  • 在示例中立即完成
  • 缩放到任意数量的列,无需额外的数据操作/编码
  • 将一个接一个地节省大量时间
  • 返回一个不错的data.table
  • 如果您确实需要列表,可以通过将return初始化为列表(return <- list()),创建计数器变量(n <- 1)然后替换rbind来获取该列表带有return[n] <- tmp的语句并在循环中递增计数器(n <- n + 1

向下侧:

  • 如果您的数据非常大(例如> 100,000行和几十个或更多值avg.var),则使用循环编写的任何循环或函数的性能都会很差