R中按日期范围的新变量

时间:2018-08-20 00:16:34

标签: r algorithm

我正在尝试创建一个基于customer_id和日期的新变量。该表是所有客户联系人的日志,因此将有重复的客户ID。我想要做的是通过使用x天之内的每个客户的联系日期来创建一个具有顺序计数的新变量。客户的所有首次联系将为= 1,如果自上次联系以来的间隔时间大于x天,则该联系为2,依此类推。我正在尝试创建“旅程”变量。

感谢您的指导。

enter image description here

代码如下:

structure(list(Customer = structure(c(1L, 1L, 2L, 3L, 3L, 3L, 
4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"), 
    Start_dt = c("2018-04-30 13:47:13", "2018-05-03 09:22:25", 
    "2018-04-22 10:45:33", "2018-04-20 09:55:51", "2018-04-21 14:20:33", 
    "2018-05-01 15:27:43", "2018-03-28 11:25:45", "2018-04-28 10:30:35", 
    "2018-05-17 11:08:51", "2018-06-02 10:38:38"), End_dt = c("2018-04-30 14:22:15", 
    "2018-05-03 10:05:32", "2018-04-22 11:00:35", "2018-04-20 09:57:45", 
    "2018-04-21 14:27:14", "2018-05-01 16:03:25", "2018-03-28 11:35:54", 
    "2018-04-28 11:02:17", "2018-05-17 12:32:18", "2018-06-02 11:08:29"
    ), Journey = c(1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 3L, 4L)), class = "data.frame", row.names = c(NA, 
-10L))

2 个答案:

答案 0 :(得分:1)

请参阅下面的算法,该算法将character矢量转换为Date对象,然后按因子列将data.frame拆分。在lapply函数内部,该算法正在使用zlag函数检查旅途标识的条件。最后,它使用do.call函数来连接数据帧。

df <- structure(list(Customer = structure(c(1L, 1L, 2L, 3L, 3L, 3L, 
4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"), 
    Start_dt = structure(c(17651, 17654, 17643, 17641, 17642, 
    17652, 17618, 17649, 17668, 17684), class = "Date"), End_dt = structure(c(17651, 
    17654, 17643, 17641, 17642, 17652, 17618, 17649, 17668, 17684
    ), class = "Date")), row.names = c(NA, -10L), class = "data.frame")
library(lubridate, TSA)
df$Start_dt <- as_date(df$Start_dt)
df$End_dt <- as_date(df$End_dt)

x <- 10 # 10 days

y <- lapply(
  X = split(df, df$Customer), 
  FUN = function(dfx) {
    dfx$lagged <- as_date(zlag(dfx$Start_dt))
    dfx$dt <- dfx$Start_dt - dfx$lagged
    dfx$dt <- ifelse(dfx$dt < x, 0, 1)
    dfx$dt[1] <- 1
    dfx$Journey <- cumsum(dfx$dt)
    dfx[, -c(5:6)]
})
z <- do.call(rbind, y)
rownames(z) <- NULL
z

输出:

   Customer   Start_dt     End_dt Journey
1         A 2018-04-30 2018-04-30       1
2         A 2018-05-03 2018-05-03       1
3         B 2018-04-22 2018-04-22       1
4         C 2018-04-20 2018-04-20       1
5         C 2018-04-21 2018-04-21       1
6         C 2018-05-01 2018-05-01       2
7         D 2018-03-28 2018-03-28       1
8         D 2018-04-28 2018-04-28       2
9         D 2018-05-17 2018-05-17       3
10        D 2018-06-02 2018-06-02       4

答案 1 :(得分:1)

出于完整性考虑,以下是Artem's cumsum() approachpublic function rules() { return [ 'item_id' => 'required|unique:items' ]; } 版本的简化版本:

data.table
library(data.table)
library(lubridate)
x <- 9
setDT(df)[, journey := 1 + cumsum(shift(End_dt, fill = End_dt[1]) + days(x) < Start_dt), 
          by = Customer]

df

数据

最初,OP仅发布了screenshot的数据。我试图借助online OCR converter来转换图像,该图像对于两个datetime列都非常有效:

    Customer            Start_dt              End_dt journey
 1:        A 2018-04-30 13:47:13 2018-04-30 14:22:15       1
 2:        A 2018-05-03 09:22:25 2018-05-03 10:05:32       1
 3:        B 2018-04-22 10:45:33 2018-04-22 11:00:35       1
 4:        C 2018-04-20 09:55:51 2018-04-20 09:57:45       1
 5:        C 2018-04-21 14:20:33 2018-04-21 14:27:14       1
 6:        C 2018-05-01 15:27:43 2018-05-01 16:03:25       2
 7:        D 2018-03-28 11:25:45 2018-03-28 11:35:54       1
 8:        D 2018-04-28 10:30:35 2018-04-28 11:02:17       2
 9:        D 2018-05-17 11:08:51 2018-05-17 12:32:18       3
10:        D 2018-06-02 10:38:38 2018-06-02 11:08:29       4