Question

我正在尝试构建一个流失模型，其中包含每个客户最大连续数量的UX故障并遇到问题。这是我的简化数据和所需的输出：

library(dplyr)
df <- data.frame(customerId = c(1,2,2,3,3,3), date = c('2015-01-01','2015-02-01','2015-02-02', '2015-03-01','2015-03-02','2015-03-03'),isFailure = c(0,0,1,0,1,1))
> df
  customerId       date isFailure
1          1 2015-01-01         0
2          2 2015-02-01         0
3          2 2015-02-02         1
4          3 2015-03-01         0
5          3 2015-03-02         1
6          3 2015-03-03         1

期望的结果：

> desired.df
  customerId maxConsecutiveFailures
1          1                      0
2          2                      1
3          3                      2

我正在喋喋不休地搜索其他问题并没有帮助我 - 这就是我“期待”类似的解决方案：

df %>% 
  group_by(customerId) %>%
  summarise(maxConsecutiveFailures = 
    max(rle(isFailure[isFailure == 1])$lengths))

Answer 1

我们按'customerId'进行分组，并使用do在'isFailure'列上执行rle。为lengths（values）提取{TRUE'的lengths[values]，并创建带有if/else条件的'Max'列，以便为那些没有df %>% group_by(customerId) %>% do({tmp <- with(rle(.$isFailure==1), lengths[values]) data.frame(customerId= .$customerId, Max=if(length(tmp)==0) 0 else max(tmp)) }) %>% slice(1L) # customerId Max #1 1 0 #2 2 1 #3 3 2条件的人返回0有任何1个值。

height

Answer 2

这是我的尝试，仅使用标准的dplyr函数：

df %>% 
  # grouping key(s):
  group_by(customerId) %>%
  # check if there is any value change
  # if yes, a new sequence id is generated through cumsum
  mutate(last_one = lag(isFailure, 1, default = 100), 
         not_eq = last_one != isFailure, 
         seq = cumsum(not_eq)) %>% 
  # the following is just to find the largest sequence
  count(customerId, isFailure, seq) %>% 
  group_by(customerId, isFailure) %>% 
  summarise(max_consecutive_event = max(n))

输出：

# A tibble: 5 x 3
# Groups:   customerId [3]
  customerId isFailure max_consecutive_event
       <dbl>     <dbl>                 <int>
1          1         0                     1
2          2         0                     1
3          2         1                     1
4          3         0                     1
5          3         1                     2

对isFailure值的最终过滤器将产生所需的结果（不过需要添加回0个失败计数的客户）。

脚本可以采用isFailure列的任何值，并计算具有相同值的最大连续天数。

用dplyr和rle汇总连续失败

2 个答案: