在For-Loop中添加Group By

时间:2018-05-07 15:33:35

标签: r for-loop dplyr

我有一个数据集如下:

# Define Adstock Rate
adstock_rate = 0.50

# Create Data
advertising = c(117.913, 120.112, 125.828, 115.354, 177.090, 141.647, 137.892,   0.000,   0.000,   0.000,   0.000, 
            0.000,   0.000,   0.000,   0.000,   0.000,   0.000, 158.511, 109.385,  91.084,  79.253, 102.706, 
            78.494, 135.114, 114.549,  87.337, 107.829, 125.020,  82.956,  60.813,  83.149,   0.000,   0.000, 
            0.000,   0.000,   0.000,   0.000, 129.515, 105.486, 111.494, 107.099,   0.000,   0.000,   0.000, 
            0.000,   0.000,   0.000,   0.000,   0.000,   0.000,   0.000,   0.000,
            134.913, 123.112, 178.828, 112.354, 100.090, 167.647, 177.892,   0.000,   0.000,   0.000,   0.000, 
            0.000,   0.000,   0.000,   0.000,   0.000,   0.000, 112.511, 155.385,  123.084,  89.253, 67.706, 
            23.494, 122.114, 112.549,  65.337, 134.829, 123.020,  81.956,  23.813,  65.149,   0.000,   0.000, 
            0.000,   0.000,   0.000,   0.000, 145.515, 154.486, 121.494, 117.099,   0.000,   0.000,   0.000, 
            0.000,   0.000,   0.000,   0.000,   0.000,   0.000,   0.000,   0.000
            )

Region = c(500, 500, 500, 500, 500, 500, 500, 500,500, 500, 500, 500,500, 500, 500, 500,500, 500, 500, 500,500, 500, 500, 500,
       500, 500, 500, 500,500, 500, 500, 500,500, 500, 500, 500,500, 500, 500, 500,500, 500, 500, 500,500, 500, 500, 500, 500, 500, 
       500, 500,
       501, 501, 501, 501, 501, 501, 501, 501,501, 501, 501, 501,501, 501, 501, 501,501, 501, 501, 501,501, 501, 501, 501,
       501, 501, 501, 501,501, 501, 501, 501,501, 501, 501, 501,501, 501, 501, 501,501, 501, 501, 501,501, 501, 501, 501, 501, 501, 
       501, 501)

advertising_dataset<-data.frame(cbind(Region, advertising))

这就是数据集的样子:

   Region advertising
1     500     117.913
2     500     120.112
3     500     125.828
4     500     115.354
5     500     177.090
6     500     141.647
7     500     137.892
8     500       0.000
9     500       0.000
10    500       0.000
11    500       0.000
12    500       0.000
13    500       0.000
14    500       0.000
15    500       0.000
16    500       0.000
17    500       0.000
18    500     158.511
19    500     109.385
20    500      91.084

从这里开始,我将应用一个滞后函数,在该函数中我取第一个值,然后应用for循环来转换我的数据集。

# Alternative Method Using Loops Proposed by Linh Tran
advertising_dataset$adstocked_advertising = numeric(length(advertising_dataset$advertising))
advertising_dataset$adstocked_advertising[1] = advertising_dataset$advertising[1]

for(i in 2:length(advertising_dataset$advertising)){
  advertising_dataset$adstocked_advertising[i] = advertising_dataset$advertising[i] + adstock_rate * advertising_dataset$adstocked_advertising[i-1]}

我遇到的问题是我的数据集是按地区分开的。我需要按区域应用上面的这个函数(包括取第一个值)。

有没有办法用dplyr包来做到这一点?

我知道这是错的,但也许是这样的:

library(dplyr)
separated_by_region<- advertising_dataset %>%
group_by(Region) %>%
summarise(
advertising_dataset$adstocked_advertising = 
numeric(length(advertising_dataset$advertising))
advertising_dataset$adstocked_advertising[1] = 
advertising_dataset$advertising[1]

for(i in 2:length(advertising_dataset$advertising)){
  advertising_dataset$adstocked_advertising[i] = 
advertising_dataset$advertising[i] + adstock_rate * 
advertising_dataset$adstocked_advertising[i-1]})

这些方面的东西。不确定如何做到这一点。

我有一种感觉我可能不得不使用split(advertising_dataset,advertising_dataset $ Region)并使用apply函数并对结果进行rbind。

任何帮助都会很棒,谢谢!

示例输出(但函数需要按区域应用)最后1个最终数据集:

  Region     advertising     adstocked_advertising
     500         117.913               117.9130000
     500         120.112               179.0685000
     500         125.828               215.3622500
     500         115.354               223.0351250
     500         177.090               288.6075625
     500         141.647               285.9507812
     500         137.892               280.8673906
     500           0.000               140.4336953
     500           0.000                70.2168477
     500           0.000                35.1084238
     500           0.000                17.5542119
     500           0.000                 8.7771060
     500           0.000                 4.3885530
     500           0.000                 2.1942765
     500           0.000                 1.0971382
     500           0.000                 0.5485691
     500           0.000                 0.2742846
     500         158.511               158.6481423
     500         109.385               188.7090711
     500          91.084               185.4385356

1 个答案:

答案 0 :(得分:1)

我认为这不是你使用dplyr的意思,或者这比do.call(rbind, lapply(...))方法更好,但你可以像上面所描述的那样定义一个函数:< / p>

foo <- function(df_) {
  df_$adstocked_advertising = df_$advertising
  for (i in 2:nrow(df_)) {
    df_$adstocked_advertising[i] = df_$advertising[i] + adstock_rate * df_$adstocked_advertising[i - 1]
  }
  return(df_)
}

然后,使用您的管道到group_by区域将该功能应用于每个组:

library(dplyr)

adv_2 <- data.frame(advertising_dataset %>%
  group_by(Region) %>%
  do(foo(data.frame(.))))


> adv_2[1:10,]
   Region advertising adstocked_advertising
1     500     117.913             117.91300
2     500     120.112             179.06850
3     500     125.828             215.36225
4     500     115.354             223.03512
5     500     177.090             288.60756
6     500     141.647             285.95078
7     500     137.892             280.86739
8     500       0.000             140.43370
9     500       0.000              70.21685
10    500       0.000              35.10842

> adv_2[50:60,]
   Region advertising adstocked_advertising
50    500       0.000              0.401496
51    500       0.000              0.200748
52    500       0.000              0.100374
53    501     134.913            134.913000
54    501     123.112            190.568500
55    501     178.828            274.112250
56    501     112.354            249.410125
57    501     100.090            224.795063
58    501     167.647            280.044531
59    501     177.892            317.914266
60    501       0.000            158.957133

但是肯定需要一个数字检查,它看起来似乎与500组的输出相匹配。

修改

根据评论,滞后值可调的版本。

foo <- function(df_, lag_val = 1) {
  df_$adstocked_advertising = df_$advertising
  for (i in (1 + lag_val):nrow(df_)) {
    df_$adstocked_advertising[i] = df_$advertising[i] + adstock_rate * df_$adstocked_advertising[i - lag_val]
  }
  return(df_)
}

默认延迟仍为1,但现在您可以更改lag_val,如果您想跳过'adstocked'列的那么多行。

adv_2 <- data.frame(advertising_dataset %>%
  group_by(Region) %>%
  do(foo(data.frame(.), lag_val = 3)))

> adv_2
    Region advertising adstocked_advertising
1      500     117.913            117.913000
2      500     120.112            120.112000
3      500     125.828            125.828000
4      500     115.354            174.310500
5      500     177.090            237.146000
6      500     141.647            204.561000
7      500     137.892            225.047250
8      500       0.000            118.573000
9      500       0.000            102.280500
10     500       0.000            112.523625

认为能做你想做的事,但绝对值得肯定。希望它能帮助您解决其他相关问题,但我猜它需要进行一些修改才能更灵活。

干杯,

路加福音