所以我基本上需要从以下数据框中找到每个阶段的平均时间。我在R中非常称职,但我不确定如何完成这项任务。 dplyr
会是最好的方法吗?
Client Stage Stage.Start
1 Client A Stage 1 2017/01/01
2 Client B Stage 1 2017/03/04
3 Client C Stage 2 2017/03/10
4 Client A Stage 2 2017/02/03
5 Client A Stage 3 2017/06/01
6 Client C Stage 3 2017/09/09
预期产出:
Client Stage Stage.Start Stage.Duration
1 Client A Stage 1 2017/01/01 31 days
2 Client B Stage 1 2017/03/04 NA
3 Client C Stage 2 2017/03/10 180 days
4 Client A Stage 2 2017/02/03 118 days
5 Client A Stage 3 2017/06/01 NA
6 Client C Stage 3 2017/09/09 NA
上卷:
Stage Avg.Duration
Stage 1 31 days
Stage 2 149 days
Stage 3 NA
答案 0 :(得分:3)
如果我正确理解了这个问题,下面的代码应该给出所需的答案:
LOOKUP
LOOKUP
请注意,我已经使用了尚未完成的阶段的实际日期来避免library(data.table)
setorder(DT, Client, Stage)
DT[, duration := shift(Stage.Start, type = "lead", fill = Sys.Date()) - Stage.Start,
by = Client][, .(avg.duration = mean(duration)), by = Stage]
并获得到目前为止的持续时间。
可替换地,
Stage avg.duration
1: Stage 1 137.5
2: Stage 2 150.5
3: Stage 3 153.0
4: Stage 4 53.0
将返回预期结果(图中的细微变化除外)
NA
如果DT[, duration := shift(Stage.Start, type = "lead") - Stage.Start, by = Client][
, .(avg.duration = mean(duration, na.rm = TRUE)), by = Stage]
和 Stage avg.duration
1: Stage 1 33.0
2: Stage 2 150.5
3: Stage 3 NaN
4: Stage 4 NaN
正确排序了数据框,OP的意图可能会更加明确:
Client
Stage
然后使用实际日期(到目前为止的持续时间)计算每个客户的持续时间:
setorder(DT, Client, Stage)
DT
默认情况下id Client Stage Stage.Start 1: 1 Client A Stage 1 2017-01-01 2: 4 Client A Stage 2 2017-02-03 3: 5 Client A Stage 3 2017-06-01 4: 2 Client B Stage 1 2017-03-04 5: 3 Client C Stage 2 2017-03-10 6: 6 Client C Stage 4 2017-09-09
或DT[, duration := shift(Stage.Start, type = "lead", fill = Sys.Date()) - Stage.Start,
by = Client][]
:
id Client Stage Stage.Start duration
1: 1 Client A Stage 1 2017-01-01 33
2: 4 Client A Stage 2 2017-02-03 118
3: 5 Client A Stage 3 2017-06-01 153
4: 2 Client B Stage 1 2017-03-04 242
5: 3 Client C Stage 2 2017-03-10 183
6: 6 Client C Stage 4 2017-09-09 53
NA
DT[, duration := shift(Stage.Start, type = "lead") - Stage.Start, by = Client][]
答案 1 :(得分:2)
如果我理解正确,您可以使用dplyr::group_by()
和summarise()
解决此问题。
df <- tribble(
~Client, ~Stage, ~Stage.Start,
"Client A", "Stage 1", "2017/01/01",
"Client B", "Stage 1", "2017/03/04",
"Client C", "Stage 2", "2017/03/10",
"Client A", "Stage 2", "2017/02/03",
"Client A", "Stage 3", "2017/06/01",
"Client C", "Stage 4", "2017/09/09"
)
df$Client <- factor(df$Client)
df$Stage <- factor(df$Stage)
df$Stage.Start <- lubridate::ymd(as.Date(df$Stage.Start))
lags <- df %>% group_by(Client) %>%
mutate(
lag_time = lag(Stage.Start),
time_diff = Stage.Start - lag_time
)
mean_by_stage <- lags %>%
group_by(Stage) %>%
summarise(
mean_diff = mean(time_diff, na.rm = TRUE)
)
编辑 - 看一下输出:
lags
# A tibble: 6 x 5
# Groups: Client [3]
Client Stage Stage.Start lag_time time_diff
<fctr> <fctr> <date> <date> <time>
1 Client A Stage 1 2017-01-01 NA NA days
2 Client B Stage 1 2017-03-04 NA NA days
3 Client C Stage 2 2017-03-10 NA NA days
4 Client A Stage 2 2017-02-03 2017-01-01 33 days
5 Client A Stage 3 2017-06-01 2017-02-03 118 days
6 Client C Stage 4 2017-09-09 2017-03-10 183 days
mean_by_stage
# A tibble: 4 x 2
Stage mean_diff
<fctr> <time>
1 Stage 1 NaN days
2 Stage 2 33 days
3 Stage 3 118 days
4 Stage 4 183 days