如何从数据框中获得每个事件的平均时间?

时间:2017-11-01 15:12:31

标签: r datetime dplyr

所以我基本上需要从以下数据框中找到每个阶段的平均时间。我在R中非常称职,但我不确定如何完成这项任务。 dplyr会是最好的方法吗?

    Client   Stage Stage.Start
1 Client A Stage 1  2017/01/01
2 Client B Stage 1  2017/03/04
3 Client C Stage 2  2017/03/10
4 Client A Stage 2  2017/02/03
5 Client A Stage 3  2017/06/01
6 Client C Stage 3  2017/09/09

预期产出:

    Client   Stage Stage.Start Stage.Duration
1 Client A Stage 1  2017/01/01 31 days
2 Client B Stage 1  2017/03/04 NA
3 Client C Stage 2  2017/03/10 180 days
4 Client A Stage 2  2017/02/03 118 days
5 Client A Stage 3  2017/06/01 NA
6 Client C Stage 3  2017/09/09 NA

上卷:

Stage    Avg.Duration
Stage 1  31 days
Stage 2  149 days
Stage 3  NA

2 个答案:

答案 0 :(得分:3)

如果我正确理解了这个问题,下面的代码应该给出所需的答案:

LOOKUP
LOOKUP

请注意,我已经使用了尚未完成的阶段的实际日期来避免library(data.table) setorder(DT, Client, Stage) DT[, duration := shift(Stage.Start, type = "lead", fill = Sys.Date()) - Stage.Start, by = Client][, .(avg.duration = mean(duration)), by = Stage] 并获得到目前为止的持续时间

可替换地,

     Stage avg.duration
1: Stage 1        137.5
2: Stage 2        150.5
3: Stage 3        153.0
4: Stage 4         53.0

将返回预期结果(图中的细微变化除外)

NA

解释

如果DT[, duration := shift(Stage.Start, type = "lead") - Stage.Start, by = Client][ , .(avg.duration = mean(duration, na.rm = TRUE)), by = Stage] Stage avg.duration 1: Stage 1 33.0 2: Stage 2 150.5 3: Stage 3 NaN 4: Stage 4 NaN 正确排序了数据框,OP的意图可能会更加明确:

Client
Stage

然后使用实际日期(到目前为止的持续时间)计算每个客户的持续时间:

setorder(DT, Client, Stage)
DT
   id   Client   Stage Stage.Start
1:  1 Client A Stage 1  2017-01-01
2:  4 Client A Stage 2  2017-02-03
3:  5 Client A Stage 3  2017-06-01
4:  2 Client B Stage 1  2017-03-04
5:  3 Client C Stage 2  2017-03-10
6:  6 Client C Stage 4  2017-09-09
默认情况下

DT[, duration := shift(Stage.Start, type = "lead", fill = Sys.Date()) - Stage.Start, by = Client][]

   id   Client   Stage Stage.Start duration
1:  1 Client A Stage 1  2017-01-01       33
2:  4 Client A Stage 2  2017-02-03      118
3:  5 Client A Stage 3  2017-06-01      153
4:  2 Client B Stage 1  2017-03-04      242
5:  3 Client C Stage 2  2017-03-10      183
6:  6 Client C Stage 4  2017-09-09       53
NA

数据

DT[, duration := shift(Stage.Start, type = "lead") - Stage.Start, by = Client][]

答案 1 :(得分:2)

如果我理解正确,您可以使用dplyr::group_by()summarise()解决此问题。

df <- tribble(
  ~Client,  ~Stage, ~Stage.Start,
   "Client A", "Stage 1",  "2017/01/01",
   "Client B", "Stage 1",  "2017/03/04",
   "Client C", "Stage 2",  "2017/03/10",
   "Client A", "Stage 2",  "2017/02/03",
   "Client A", "Stage 3",  "2017/06/01",
   "Client C", "Stage 4",  "2017/09/09"
)

df$Client <- factor(df$Client)
df$Stage <- factor(df$Stage)
df$Stage.Start <- lubridate::ymd(as.Date(df$Stage.Start))

lags <- df %>% group_by(Client) %>% 
  mutate(
    lag_time = lag(Stage.Start),
    time_diff = Stage.Start - lag_time
  ) 

mean_by_stage <- lags %>% 
  group_by(Stage) %>% 
  summarise(
    mean_diff = mean(time_diff, na.rm = TRUE)
  )
编辑 - 看一下输出:
lags
# A tibble: 6 x 5
# Groups:   Client [3]
    Client   Stage Stage.Start   lag_time time_diff
    <fctr>  <fctr>      <date>     <date>    <time>
1 Client A Stage 1  2017-01-01         NA   NA days
2 Client B Stage 1  2017-03-04         NA   NA days
3 Client C Stage 2  2017-03-10         NA   NA days
4 Client A Stage 2  2017-02-03 2017-01-01   33 days
5 Client A Stage 3  2017-06-01 2017-02-03  118 days
6 Client C Stage 4  2017-09-09 2017-03-10  183 days


mean_by_stage   
# A tibble: 4 x 2
    Stage mean_diff
   <fctr>    <time>
1 Stage 1  NaN days
2 Stage 2   33 days
3 Stage 3  118 days
4 Stage 4  183 days