通过indeces聚合并在R中重新加权

时间:2016-02-13 16:37:03

标签: r aggregate weighted-average

我有大量按州,日期和UPC(产品代码)索引的Price数据。我想汇总UPC,并通过加权平均值来合并价格。我会尝试解释它,但您可能只想阅读下面的代码。

数据集中的每个观察结果为:UPC,日期,州,价格和重量。我想以这种方式汇总UPC索引:

获取具有相同日期和状态的所有数据点,并按其权重对其价格进行多次加总并进行求和。这显然会产生一个加权平均值,我称之为priceIndex。但是,对于某些日期和状态组合,权重不会加到1.因此,我想创建两个额外的列:一个用于每个日期和状态组合的权重总和。第二个是重新加权平均值:即,如果原来的两个权重是.5和.3,则将它们更改为.5 /(.5 + .3)=。625和.3 /(。5 + .3)= .375,然后将加权平均值重新计算为另一个价格指数。

这就是我的意思:

upc=c(1153801013,1153801013,1153801013,1153801013,1153801013,1153801013,2105900750,2105900750,2105900750,2105900750,2105900750,2173300001,2173300001,2173300001,2173300001)
date=c(200601,200602,200603,200603,200601,200602,200601,200602,200603,200601,200602,200601,200602,200603,200601)
price=c(26,28,27,27,23,24,85,84,79.5,81,78,24,19,98,47)
state=c(1,1,1,2,2,2,1,1,2,2,2,1,1,1,2)
weight=c(.3,.2,.6,.4,.4,.5,.5,.5,.45,.15,.5,.2,.15,.3,.45)

# This is what I have:
data <- data.frame(upc,date,state,price,weight)
data

# These are a few of the weighted calculations:
# .3*26+85*.5+24*.2 = 55.1
# 28*.2+84*.5+19*.15 = 50.45
# 27*.6+98*.3 = 45.6
# Etc. etc.

# Here is the reweighted calculation for date=200602 & state==1:
# 28*(.2/.85)+84*(.5/.85)+19*(.15/.85) = 50.45
# Or, equivalently:
# (28*.2+84*.5+19*.15)/.85 = 50.45

# This is what I want:
date=c(200601,200602,200603,200601,200602,200603)
state=c(1,1,1,2,2,2)
priceIndex=c(55.1,50.45,45.6,42.5,51,46.575)
totalWeight=c(1,.85,.9,1,1,.85)
reweightedIndex=c(55.1,59.35294,50.66667,42.5,51,54.79412)
index <- data.frame(date,state,priceIndex,totalWeight,reweightedIndex)
index

此外,并不重要,但数据集中大约有35个州,150个UPC和84个日期 - 所以有很多观察结果。

提前多多感谢。

1 个答案:

答案 0 :(得分:2)

我们可以通过总结操作使用其中一个组。使用data.table,我们会转换&#39; data.frame&#39;到&#39; data.table&#39; (setDT(data),按日期&#39;分组,&#39;州&#39;,我们得到&{39;价格&#39;和&#39;的产品的sum ; weight&#39;和sum(weight)作为临时变量,然后根据该变量在list中创建3个变量。

library(data.table) 
setDT(data)[, {tmp1 = sum(price*weight)
                tmp2 = sum(weight)
        list(priceIndex=tmp1, totalWeight=tmp2,
              reweigthedIndex = tmp1/tmp2)}, .(date, state)]
#    date state priceIndex totalWeight reweightedIndex
#1: 200601     1     55.100        1.00        55.10000
#2: 200602     1     50.450        0.85        59.35294
#3: 200603     1     45.600        0.90        50.66667
#4: 200603     2     46.575        0.85        54.79412
#5: 200601     2     42.500        1.00        42.50000
#6: 200602     2     51.000        1.00        51.00000

或者使用dplyr,我们可以使用summarise在按照&#39; date&#39;进行分组后创建3列。和&#39;州&#39;。

library(dplyr)
data %>% 
  group_by(date, state) %>% 
  summarise(priceIndex = sum(price*weight),
            totalWeight = sum(weight),
            reweightedIndex = priceIndex/totalWeight)
#   date state priceIndex totalWeight reweightedIndex
#   (dbl) (dbl)      (dbl)       (dbl)           (dbl)
#1 200601     1     55.100        1.00        55.10000
#2 200601     2     42.500        1.00        42.50000
#3 200602     1     50.450        0.85        59.35294
#4 200602     2     51.000        1.00        51.00000
#5 200603     1     45.600        0.90        50.66667
#6 200603     2     46.575        0.85        54.79412