我正在尝试创建一个基于customer_id和日期的新变量。该表是所有客户联系人的日志,因此将有重复的客户ID。我想要做的是通过使用x天之内的每个客户的联系日期来创建一个具有顺序计数的新变量。客户的所有首次联系将为= 1,如果自上次联系以来的间隔时间大于x天,则该联系为2,依此类推。我正在尝试创建“旅程”变量。
感谢您的指导。
代码如下:
structure(list(Customer = structure(c(1L, 1L, 2L, 3L, 3L, 3L,
4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
Start_dt = c("2018-04-30 13:47:13", "2018-05-03 09:22:25",
"2018-04-22 10:45:33", "2018-04-20 09:55:51", "2018-04-21 14:20:33",
"2018-05-01 15:27:43", "2018-03-28 11:25:45", "2018-04-28 10:30:35",
"2018-05-17 11:08:51", "2018-06-02 10:38:38"), End_dt = c("2018-04-30 14:22:15",
"2018-05-03 10:05:32", "2018-04-22 11:00:35", "2018-04-20 09:57:45",
"2018-04-21 14:27:14", "2018-05-01 16:03:25", "2018-03-28 11:35:54",
"2018-04-28 11:02:17", "2018-05-17 12:32:18", "2018-06-02 11:08:29"
), Journey = c(1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 3L, 4L)), class = "data.frame", row.names = c(NA,
-10L))
答案 0 :(得分:1)
请参阅下面的算法,该算法将character
矢量转换为Date
对象,然后按因子列将data.frame
拆分。在lapply
函数内部,该算法正在使用zlag
函数检查旅途标识的条件。最后,它使用do.call
函数来连接数据帧。
df <- structure(list(Customer = structure(c(1L, 1L, 2L, 3L, 3L, 3L,
4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
Start_dt = structure(c(17651, 17654, 17643, 17641, 17642,
17652, 17618, 17649, 17668, 17684), class = "Date"), End_dt = structure(c(17651,
17654, 17643, 17641, 17642, 17652, 17618, 17649, 17668, 17684
), class = "Date")), row.names = c(NA, -10L), class = "data.frame")
library(lubridate, TSA)
df$Start_dt <- as_date(df$Start_dt)
df$End_dt <- as_date(df$End_dt)
x <- 10 # 10 days
y <- lapply(
X = split(df, df$Customer),
FUN = function(dfx) {
dfx$lagged <- as_date(zlag(dfx$Start_dt))
dfx$dt <- dfx$Start_dt - dfx$lagged
dfx$dt <- ifelse(dfx$dt < x, 0, 1)
dfx$dt[1] <- 1
dfx$Journey <- cumsum(dfx$dt)
dfx[, -c(5:6)]
})
z <- do.call(rbind, y)
rownames(z) <- NULL
z
输出:
Customer Start_dt End_dt Journey
1 A 2018-04-30 2018-04-30 1
2 A 2018-05-03 2018-05-03 1
3 B 2018-04-22 2018-04-22 1
4 C 2018-04-20 2018-04-20 1
5 C 2018-04-21 2018-04-21 1
6 C 2018-05-01 2018-05-01 2
7 D 2018-03-28 2018-03-28 1
8 D 2018-04-28 2018-04-28 2
9 D 2018-05-17 2018-05-17 3
10 D 2018-06-02 2018-06-02 4
答案 1 :(得分:1)
出于完整性考虑,以下是Artem's cumsum()
approach的public function rules()
{
return [
'item_id' => 'required|unique:items'
];
}
版本的简化版本:
data.table
library(data.table) library(lubridate) x <- 9 setDT(df)[, journey := 1 + cumsum(shift(End_dt, fill = End_dt[1]) + days(x) < Start_dt), by = Customer] df
最初,OP仅发布了screenshot的数据。我试图借助online OCR converter来转换图像,该图像对于两个datetime列都非常有效:
Customer Start_dt End_dt journey
1: A 2018-04-30 13:47:13 2018-04-30 14:22:15 1
2: A 2018-05-03 09:22:25 2018-05-03 10:05:32 1
3: B 2018-04-22 10:45:33 2018-04-22 11:00:35 1
4: C 2018-04-20 09:55:51 2018-04-20 09:57:45 1
5: C 2018-04-21 14:20:33 2018-04-21 14:27:14 1
6: C 2018-05-01 15:27:43 2018-05-01 16:03:25 2
7: D 2018-03-28 11:25:45 2018-03-28 11:35:54 1
8: D 2018-04-28 10:30:35 2018-04-28 11:02:17 2
9: D 2018-05-17 11:08:51 2018-05-17 12:32:18 3
10: D 2018-06-02 10:38:38 2018-06-02 11:08:29 4