如何使用group_by语句在R中执行逐行除法?

时间:2015-05-07 09:04:00

标签: r dataframe dplyr

我有以下数据框

Year    Category      TotalSales    AverageCount
1   2013    Beverages      102074.29    22190.06
2   2013    Condiments      55277.56    14173.73
3   2013    Confections     36415.75    12138.58
4   2013    Dairy Products  30337.39    24400.00
5   2013    Seafood         53019.98    27905.25
6   2014    Beverages       81338.06    35400.00
7   2014    Condiments      55948.82    19981.72
8   2014    Confections     44478.36    24710.00
9   2014    Dairy Products  84412.36    32466.00
10  2014    Seafood         65544.19    14565.37

我计算了TotalSales的累计总和,按年份按以下方法分组

dat <-within(dat, {
  RunningTotal <- ave(dat$TotalSales, dat$Year, FUN = cumsum)
}) 

,输出就是这个,

    Year    Category        TotalSales AverageCount RunningTotal
1   2013    Beverages       102074.29   22190.06    102074.29
2   2013    Condiments      55277.56    14173.73    157351.85
3   2013    Confections     36415.75    12138.58    193767.60
4   2013    Dairy Products  30337.39    24400.00    224104.99
5   2013    Seafood         53019.98    27905.25    277124.97
6   2014    Beverages       81338.06    35400.00    81338.06
7   2014    Condiments      55948.82    19981.72    137286.88
8   2014    Confections     44478.36    24710.00    181765.24
9   2014    Dairy Products  84412.36    32466.00    266177.60
10  2014    Seafood         65544.19    14565.37    331721.79

如何计算行RunningTotal中元素的分组比率(RunningTotal[i+1] and RunningTotal[i]之间的比率)?

我尝试使用mutate

中的dplyr
require(dplyr)
dat<-mutate(dat, Ratio = lag(RunningTotal)/RunningTotal)

我收到错误的输出(注意NAs)

    Year    Category       TotalSales AverageCount  RunningTotal Ratio
1   2013    Beverages       102074.29   22190.06    102074.29   NA
2   2013    Condiments      55277.56    14173.73    157351.85   0.6487009
3   2013    Confections     36415.75    12138.58    193767.60   0.8120648
4   2013    Dairy Products  30337.39    24400.00    224104.99   0.8646287
5   2013    Seafood         53019.98    27905.25    277124.97   0.8086784
6   2014    Beverages       81338.06    35400.00    81338.06    NA
7   2014    Condiments      55948.82    19981.72    137286.88   0.5924678
8   2014    Confections     44478.36    24710.00    181765.24   0.7552978
9   2014    Dairy Products  84412.36    32466.00    266177.60   0.6828720
10  2014    Seafood         65544.19    14565.37    331721.79   0.8024122

如何获得所需的输出,如下所示?

Year    Category       TotalSales AverageCount RunningTotal    Ratio
2013    Beverages       102074.29   22190.06    102074.29   1.5415424393
2013    Condiments      55277.56    14173.73    157351.85   1.2314288011
2013    Confections     36415.75    12138.58    193767.6    1.1565658552
2013    Dairy Products  30337.39    24400       224104.99   1.2365854504
2013    Seafood         53019.98    27905.25    277124.97   0.2935067887
2014    Beverages       81338.06    35400       81338.06    1.6878553533
2014    Condiments      55948.82    19981.72    137286.88   1.3239811408
2014    Confections     44478.36    24710       181765.24   1.4644032049
2014    Dairy Products  84412.36    32466       266177.6    1.2462423209
2014    Seafood         65544.19    14565.37    331721.79   0

示例数据:

dat <- structure(list(Year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2014L, 
2014L, 2014L, 2014L, 2014L), Category = structure(c(1L, 2L, 3L, 
4L, 5L, 1L, 2L, 3L, 4L, 5L), .Label = c("Beverages", "Condiments", 
"Confections", "Dairy Products", "Seafood"), class = "factor"), 
    TotalSales = c(102074.29, 55277.56, 36415.75, 30337.39, 53019.98, 
    81338.06, 55948.82, 44478.36, 84412.36, 65544.19), AverageCount = c(22190.06, 
    14173.73, 12138.58, 24400, 27905.25, 35400, 19981.72, 24710, 
    32466, 14565.37)), .Names = c("Year", "Category", "TotalSales", 
"AverageCount"), class = "data.frame", row.names = c(NA, -10L
)

1 个答案:

答案 0 :(得分:1)

执行第一次操作的dplyr方式是:

dat <- dat %>% 
  group_by(Year) %>% 
  mutate(RunningTotal = cumsum(TotalSales)) %>% 
  ungroup

然后添加比率,使用

dat %>% 
  mutate(Ratio = c(RunningTotal[-1] / RunningTotal[-n()], 0))

虽然我很想提出最后一个值NA,而不是0。 2013海鲜(0.2935067887)的比例也没有任何意义。要摆脱这种情况,您不想执行取消分组。所以像这样:

dat %>% 
  group_by(Year) %>% 
  mutate(
    RunningTotal = cumsum(TotalSales),
    Ratio = c(RunningTotal[-1] / RunningTotal[-n()], NA)
  )