Question

我试图修改此处发布的解决方案Create cohort dropout rate table from raw data

我想使用这些数据创建一个CumulATIVE辍学率表。

DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
     11,12,13,14,15,16,17,18,19,20,
     21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
       2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
   2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
        2,2,2,1,1,2,1,2,1,2,
        1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))

到目前为止，我已经能够达到这一点

     library(tidyverse)

DT %>% 
  group_by(year) %>% 
  count(cohort) %>% 
  ungroup() %>% 
  spread(year, n) %>% 
  mutate(y2014_2015_dropouts = (`2014` - `2015`),
         y2015_2016_dropouts =  (`2015` - `2016`)) %>% 
  mutate(y2014_2015_cumulative =y2014_2015_dropouts/`2014`,
         y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative)%>%


  replace_na(list(y2014_2015_dropouts = 0.0,
                  y2015_2016_dropouts = 0.0)) %>% 
  select(cohort, y2014_2015_dropouts, y2015_2016_dropouts, y2014_2015_cumulative,y2015_2016_cumulative )

累积辍学率表反映了一年中辍学的学生比例。

     # A tibble: 3 x 5
  cohort y2014_2015_dropouts y2015_2016_dropouts y2014_2015_cumulative y2015_2016_cumulative
   <dbl>               <dbl>               <dbl>                 <dbl>                 <dbl>
1      1                   6                   2                   0.6                   0.8
2      2                   0                   2                  NA                    NA  
3      3                   0                   0                  NA                    NA  
>

该组的最后两栏显示，截至2014 - 2015年底，60％的队列1学生退学;截至2015-2016学年，80％的1名学生已退学。

我想为队列2和3计算相同的内容，但我不知道该怎么做。

Answer 1

以下是另一种data.table解决方案，可以让您的数据以我觉得更容易处理的方式进行整理。使用您的DT输入数据：

按队列和年份组织和排序：

DT2 <- DT[, .N, list(cohort, year)][order(cohort, year)]

指定年份范围：

DT2[, year := paste(lag(year), year, sep = "_"),]

每年辍学

DT2[, dropouts := ifelse(!is.na(lag(N)), lag(N) - N, 0), , cohort, ]

获得每个队列每年减少的累计比例：

DT2[, cumul := cumsum(dropouts) / max(N), cohort]

输出：

> DT2
   cohort      year  N dropouts     cumul
1:      1   NA_2014 10        0 0.0000000
2:      1 2014_2015  4        6 0.6000000
3:      1 2015_2016  2        2 0.8000000
4:      2 2016_2015  6        0 0.0000000
5:      2 2015_2016  4        2 0.3333333
6:      3 2016_2016  9        0 0.0000000

Answer 2

由于您在管道中提前一年传播数据，并且2014列的所有内容都与{2}相关，因此您需要在计算NA时合并分母}。如果从当前

替换该变量的定义

y2015_2016_cumulative

到

y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative

你应该好好去。 coalesce函数尝试第一个参数，但如果第一个参数是

y2015_2016_cumulative =y2015_2016_dropouts/coalesce(`2014`, `2015`) +
coalesce(y2014_2015_cumulative, 0)

，则输入第二个参数。话虽如此，这种当前的方法并不是极易扩展的。您必须为添加的每年添加其他合并语句。如果您将数据保持整洁的格式，则可以使用

保持年度队列级别的运行列表

NA

如何从原始数据

2 个答案: