我目前在R中遇到一个非常具体的问题:我有一个大约的数据集。 250万行,显示有关购买行程的基于事件的数据。格式如下(为简单起见,我排除了大多数人口统计数据和其他一些变量):
UserID PurchaseID Time of Contact Purchase Age
1 1 2015-08-07 19:16:59 0 35
1 1 2015-08-07 21:17:32 0 35
1 1 2015-08-07 22:42:51 0 35
1 1 2015-08-07 23:06:13 0 35
1 2 2016-05-26 11:01:16 1 35
1 2 2016-06-02 19:57:25 1 35
1 2 2016-06-15 15:48:20 1 35
1 2 2016-06-21 08:39:44 1 35
2 3 2015-11-14 11:32:10 0 51
2 3 2015-11-14 11:32:20 0 51
2 3 2015-11-14 11:33:50 0 51
我想分析每个单独旅程的联系人之间的平均时间如何影响购买概率。因此,我想计算每个客户旅程的总长度(例如,购买ID 1的开始时间直到PurchaseID 1的结束时间)。之后我想聚合数据,看起来如下:
UserID PurchaseID Customer journey length Purchase Age
1 1 03:49:14 0 35
1 2 621:38:28 1 35
2 3 00:01:40 0 51
老实说,我不知道从哪里开始,所以我希望你能帮助我!非常感谢!
答案 0 :(得分:1)
这应该可以胜任(使用非常小的样本,请测试一下):
library(dplyr)
library(lubridate)
df <- data.frame(userID=c(1,1),
PurchaseID=c(1,1),
Contactime= c(ymd_hms("2015-08-07 19:16:59"), ymd_hms("2015-08-07 21:16:59")),
Purchase=c(0,0),
Age=c(35, 35))
timesummary<- df %>%
group_by( userID,PurchaseID, Purchase, Age) %>%
summarise(journeylength= as.numeric(difftime(max(Contactime),min(Contactime), units="secs")))
请注意,我已经以秒的形式给出了行程长度,这可以改变。
答案 1 :(得分:0)
以下是提供的解决方案的替代方案
dat1 <- aggregate(. ~PurchaseID+UserID, data=df[,1:3], function(V)max(V)-min(V))
dat2 <- aggregate(. ~PurchaseID+UserID, data=df[,c(1:2, 4)], sum)
dat3 <- aggregate(. ~PurchaseID+UserID, data=df[,c(1:2, 5)], mean)
dat <- merge(merge(dat1, dat2, by = c("PurchaseID", "UserID")),
dat3, by = c("PurchaseID", "UserID"))
)
dat <- dat[-which(dat$TimeofContact == 0),]
# some polishing
names(dat)[3] <- "CustomerJourneyLength"
# converting time differences in a more suitable format
hours <- dat$CustomerJourneyLength %/% 3600
minutes <- (dat$CustomerJourneyLength %% 3600)%/%60
seconds <- (dat$CustomerJourneyLength %% 3600)%%60
dat$CustomerJourneyLength <- paste0(hours, " hours ", minutes, " minutes ", round(seconds), " seconds")
# which yields
> dat
PurchaseID UserID CustomerJourneyLength Purchase Age
1 1 1 15 hours 28 minutes 49 seconds 1 27
2 1 2 15 hours 21 minutes 44 seconds 3 31
3 2 1 4 hours 11 minutes 17 seconds 2 27
5 3 1 9 hours 39 minutes 45 seconds 1 27
6 3 2 14 hours 36 minutes 31 seconds 1 31
以下是我使用的数据
df <- data.frame(UserID = sample(1:2, 20, replace = T),
PurchaseID = sample(1:3, 20, replace = T),
TimeofContact = runif(20, Sys.time(), Sys.time() + 20*3600),
Purchase = sample(0:1, 20, replace = T),
Age = rep(NA, 20))
df$Age[which(df$UserID == 1)] <- sample(20:40, 1)
df$Age[which(df$UserID == 2)] <- sample(20:40, 1)
答案 2 :(得分:0)
使用data.table,它将快速运行。
library(data.table)
重新创建数据:
dat <-
data.table(
UserID = round(runif(1e5, 1, 1e5 / 5)),
PurchaseID = round(runif(1e5, 1, 5)),
timeOfContact = as.POSIXct(runif(1e5, 0, 2e5), origin = '2017-09-20'),
Purchase = round(runif(1e5, 0, 1)),
age = round(runif(1e5, 15, 65))
)
dat[, age := max(age), .(UserID)]
dat[, Purchase := max(Purchase), .(UserID, PurchaseID)]
一行代码:
dat[, .(customerJourneyLength = as.numeric(difftime(
max(timeOfContact),
min(timeOfContact),
tz = 'GMT',
units = 'secs'
))), .(UserID, PurchaseID, Purchase, age)]
另外,请避免使用包含空格的列名。