复杂的累加总和,两次重置

时间:2018-08-20 18:31:47

标签: r dplyr cumsum

我正在尝试遵循一些有关何时将数据分组到图表的规则。 我将如何从这个数据框开始:

# A tibble: 11 x 8
   assay      year   qtr invalid valid total_assays    hfr predicted_inv
   <chr>     <dbl> <dbl>   <dbl> <dbl>        <dbl>  <dbl>         <dbl>
 1 test_case 2016.    1.      2.   36.          38. 0.0350         1.33 
 2 test_case 2016.    2.      1.   34.          35. 0.0350         1.23 
 3 test_case 2016.    3.      0.   25.          25. 0.0350         0.875
 4 test_case 2016.    4.      2.   23.          25. 0.0350         0.875
 5 test_case 2017.    1.      1.   29.          30. 0.0350         1.05 
 6 test_case 2017.    2.      2.   24.          26. 0.0350         0.910
 7 test_case 2017.    3.      0.   23.          23. 0.0350         0.805
 8 test_case 2017.    4.      1.   20.          21. 0.0350         0.735
 9 test_case 2018.    1.      2.   33.          35. 0.0350         1.23 
10 test_case 2018.    2.      5.   28.          33. 0.0350         1.16 
11 test_case 2018.    3.      4.    9.          13. 0.0350         0.455

对此:

       assay year qtr invalid valid total_assays   hfr predicted_inv co_inv co_val co_prd_inv trend
1  test_case 2016   1       2    36           38 0.035         1.330      2     36      1.330    No
2  test_case 2016   2       1    34           35 0.035         1.225      3     70      2.555    No
3  test_case 2016   3       0    25           25 0.035         0.875      3     95      3.430    No
4  test_case 2016   4       2    23           25 0.035         0.875      5    118      4.305   Yes
5  test_case 2017   1       1    29           30 0.035         1.050      1     29      1.050    No
6  test_case 2017   2       2    24           26 0.035         0.910      3     53      1.960    No
7  test_case 2017   3       0    23           23 0.035         0.805      3     76      2.765    No
8  test_case 2017   4       1    20           21 0.035         0.735      4     96      3.500    No
9  test_case 2018   1       2    33           35 0.035         1.225      6    129      4.725   Yes
10 test_case 2018   2       5    28           33 0.035         1.155      5     28      1.155   Yes
11 test_case 2018   3       4     9           13 0.035         0.455      4      9      0.455    No

规则很简单。对于每一行,如果invalid或Forecast_inv的累加总和为5或更大,则趋势为“是”,并且将重置所有三个参数(无效,有效,predicted_inv)的累加总和,然后从下一行重新开始。最后,将对分组(co_ *)进行趋势分析。

我已经尝试过使用dplyr解决方案,但是当我尝试同时创建多个相互依赖的变量时,却不断出错。

现在,我正在尝试仅将3个参数作为向量的自定义函数,但是我一直被迫构建循环...我希望使用一种易于阅读的dplyr解决方案。

这里是垃圾:

egdf1 <- structure(list(assay = c("test_case", "test_case", "test_case", 
                         "test_case", "test_case", "test_case", "test_case", "test_case", 
                         "test_case", "test_case", "test_case"), year = c(2016, 2016, 
                                                                          2016, 2016, 2017, 2017, 2017, 2017, 2018, 2018, 2018), qtr = c(1, 
                                                                                                                                         2, 3, 4, 1, 2, 3, 4, 1, 2, 3), invalid = c(2, 1, 0, 2, 1, 2, 
                                                                                                                                                                                    0, 1, 2, 5, 4), valid = c(36, 34, 25, 23, 29, 24, 23, 20, 33, 
                                                                                                                                                                                                              28, 9), total_assays = c(38, 35, 25, 25, 30, 26, 23, 21, 35, 
                                                                                                                                                                                                                                       33, 13), hfr = c(0.035, 0.035, 0.035, 0.035, 0.035, 0.035, 0.035, 
                                                                                                                                                                                                                                                        0.035, 0.035, 0.035, 0.035), predicted_inv = c(1.33, 1.225, 0.875, 
                                                                                                                                                                                                                                                                                                       0.875, 1.05, 0.91, 0.805, 0.735, 1.225, 1.155, 0.455)), .Names = c("assay", 
                                                                                                                                                                                                                                                                                                                                                                          "year", "qtr", "invalid", "valid", "total_assays", "hfr", "predicted_inv"
                                                                                                                                                                                                                                                                                                       ), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                                                  -11L))

egdf2 <- structure(list(assay = c("test_case", "test_case", "test_case", 
                         "test_case", "test_case", "test_case", "test_case", "test_case", 
                         "test_case", "test_case", "test_case"), year = c(2016L, 2016L, 
                                                                          2016L, 2016L, 2017L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L
                         ), qtr = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L), invalid = c(2L, 
                                                                                             1L, 0L, 2L, 1L, 2L, 0L, 1L, 2L, 5L, 4L), valid = c(36L, 34L, 
                                                                                                                                                25L, 23L, 29L, 24L, 23L, 20L, 33L, 28L, 9L), total_assays = c(38L, 
                                                                                                                                                                                                              35L, 25L, 25L, 30L, 26L, 23L, 21L, 35L, 33L, 13L), hfr = c(0.035, 
                                                                                                                                                                                                                                                                         0.035, 0.035, 0.035, 0.035, 0.035, 0.035, 0.035, 0.035, 0.035, 
                                                                                                                                                                                                                                                                         0.035), predicted_inv = c(1.33, 1.225, 0.875, 0.875, 1.05, 0.91, 
                                                                                                                                                                                                                                                                                                   0.805, 0.735, 1.225, 1.155, 0.455), co_inv = c(2L, 3L, 3L, 5L, 
                                                                                                                                                                                                                                                                                                                                                  1L, 3L, 3L, 4L, 6L, 5L, 4L), co_val = c(36L, 70L, 95L, 118L, 
                                                                                                                                                                                                                                                                                                                                                                                          29L, 53L, 76L, 96L, 129L, 28L, 9L), co_prd_inv = c(1.33, 2.555, 
                                                                                                                                                                                                                                                                                                                                                                                                                                             3.43, 4.305, 1.05, 1.96, 2.765, 3.5, 4.725, 1.155, 0.455), trend = c("No", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "No", "No", "Yes", "No", "No", "No", "No", "Yes", "Yes", "No"
                                                                                                                                                                                                                                                                                                                                                                                                                                             )), .Names = c("assay", "year", "qtr", "invalid", "valid", "total_assays", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                            "hfr", "predicted_inv", "co_inv", "co_val", "co_prd_inv", "trend"
                                                                                                                                                                                                                                                                                                                                                                                                                                         ), class = "data.frame", row.names = c(NA, -11L))

2 个答案:

答案 0 :(得分:1)

使用MESS软件包中的函数cumsumbinning设置阈值的值,该阈值不得超过累积组总和(在您的示例中为5)。请记住,在第9行中,因为加2到4越过阈值5,则会创建另一个组,而在您想要的输出中,您需要在下一行进行重置。

library(MESS)  
  egdf1 %>%
  group_by(group = cumsumbinning(invalid, 5)) %>%
  mutate(
    co_inv = cumsum(invalid),
    co_val = cumsum(valid),
    co_prd_inv = cumsum(predicted_inv),
    trend = ifelse(group - lag(group, default = 0) > 1, "yes", "no")
  )

输出

   assay      year   qtr invalid valid total_assays   hfr predicted_inv group co_inv co_val co_prd_inv trend
   <chr>     <dbl> <dbl>   <dbl> <dbl>        <dbl> <dbl>         <dbl> <int>  <dbl>  <dbl>      <dbl> <chr>
 1 test_case  2016     1       2    36           38 0.035         1.33      1      2     36      1.33  no   
 2 test_case  2016     2       1    34           35 0.035         1.23      1      3     70      2.56  no   
 3 test_case  2016     3       0    25           25 0.035         0.875     1      3     95      3.43  no   
 4 test_case  2016     4       2    23           25 0.035         0.875     1      5    118      4.30  no   
 5 test_case  2017     1       1    29           30 0.035         1.05      2      1     29      1.05  yes  
 6 test_case  2017     2       2    24           26 0.035         0.91      2      3     53      1.96  no   
 7 test_case  2017     3       0    23           23 0.035         0.805     2      3     76      2.76  no   
 8 test_case  2017     4       1    20           21 0.035         0.735     2      4     96      3.5   no   
 9 test_case  2018     1       2    33           35 0.035         1.23      3      2     33      1.23  yes  
10 test_case  2018     2       5    28           33 0.035         1.16      4      5     28      1.16  yes  
11 test_case  2018     3       4     9           13 0.035         0.455     5      4      9      0.455 yes 

答案 1 :(得分:1)

使用的基本R解决方案:

Reduce