更改特定组

时间:2018-05-04 19:59:44

标签: r dplyr

考虑以下虚拟数据集:

library(dplyr)
df <- structure(list(x = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 7L, 
                                     1L, 2L, 3L, 4L, 5L, 6L, 7L, 7L), 
                                   .Label = c("1", "2", "3", "4", 
                                              "5", "6", "Total"), class = "factor"), 
                     y = structure(c(1L, 1L, 
                                     2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), 
                                   .Label = c("7", "8", "9", "Total"), class = "factor"), 
                     z = structure(c(1L, 2L, 
                                     1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), 
                                   .Label = c("10", "11"), class = "factor"), 
                     count = c(56, 89, 12, 119, 3, 2, 71, 
                               210, 22, 64, 53, 0, 136, 11, 211, 75), 
                     date = structure(c(17866, 
                                        17866, 17866, 17866, 17866, 17866, 17866, 17866, 17501, 17501, 
                                        17501, 17501, 17501, 17501, 17501, 17501), class = "Date")), 
                class = "data.frame", 
                row.names = c(NA, -16L), 
                .Names = c("x", "y", "z", "count", "date")) %>%
  filter(count != 0)

> df
       x     y  z count       date
1      1     7 10    56 2018-12-01
2      2     7 11    89 2018-12-01
3      3     8 10    12 2018-12-01
4      4     8 11   119 2018-12-01
5      5     9 10     3 2018-12-01
6      6     9 11     2 2018-12-01
7  Total Total 10    71 2018-12-01
8  Total Total 11   210 2018-12-01
9      1     7 10    22 2017-12-01
10     2     7 11    64 2017-12-01
11     3     8 10    53 2017-12-01
12     5     9 10   136 2017-12-01
13     6     9 11    11 2017-12-01
14 Total Total 10   211 2017-12-01
15 Total Total 11    75 2017-12-01

我有兴趣通过略微修改来计算年度变化百分比。

这是未修改的版本(我想要的,但已关闭):

df_yoy <- df %>%
  group_by(x, y, z) %>%
  summarize(YoY = count[date == max(date)]/count[date == min(date)] - 1) %>%
  as.data.frame()

> df_yoy
      x     y  z        YoY
1     1     7 10  1.5454545
2     2     7 11  0.3906250
3     3     8 10 -0.7735849
4     4     8 11  0.0000000
5     5     9 10 -0.9779412
6     6     9 11 -0.8181818
7 Total Total 10 -0.6635071
8 Total Total 11  1.8000000 <-- obtained by doing 210/75-1

请注意我是如何专门召唤最后一行的。以下是我想要的要求:

  1. count值必须保持不变。
  2. count上未衡量x == 4 & y == 8 & z == 11时的2017-12-01。因此,在计算总行的同比百分比变化时,需要在分子x == 4 & y == 8 & z == 11 中排除count[date == max(date)]时的计数。
  3. 因此,这里是我正在寻找的输出

    > df_yoy
          x     y  z        YoY
    1     1     7 10  1.5454545
    2     2     7 11  0.3906250
    3     3     8 10 -0.7735849
    4     4     8 11  0.0000000
    5     5     9 10 -0.9779412
    6     6     9 11 -0.8181818
    7 Total Total 10 -0.6635071
    8 Total Total 11  0.2133333 <-- obtained by doing (210-119)/75-1
    

    请注意,在119时,210count减去x == 4 & y == 8 & z == 11值。

    有没有办法修改summarize()来执行此更改?我已尝试使用ifelse()case_when(),但没有成功。

1 个答案:

答案 0 :(得分:1)

ungroup可以访问解决方案并重新组合以使用dplyr进行转换。

注意:解决方案可以用简洁的形式编写,但我选择以一种比较详细的方式编写,以便OP /读者更容易理解逻辑。

library(dplyr)
df %>% mutate(count = ifelse(count==0, NA, count)) %>%
  group_by(x, y, z) %>%
  summarize(YoYNume = count[date == max(date)], YoYDeno = count[date == min(date)]) %>%
  group_by(z) %>%
  mutate(valueToDiscard = sum(ifelse(is.na(YoYDeno),YoYNume,0))) %>%
  mutate(YoYNume = ifelse(x=="Total", YoYNume - valueToDiscard, valueToDiscard)) %>%
  group_by(x,y,z) %>%
  summarise(YoY = YoYNume/YoYDeno - 1) %>%
  as.data.frame()

#       x     y  z        YoY
# 1     1     7 10 -1.0000000
# 2     2     7 11  0.8593750
# 3     3     8 10 -1.0000000
# 4     4     8 11         NA
# 5     5     9 10 -1.0000000
# 6     6     9 11  9.8181818
# 7 Total Total 10 -0.6635071
# 8 Total Total 11  0.2133333