我有一个大型数据框,可以按不同的客户ID(ID)进行分组。每个ID都有几个访问日期(VisitingTime)。如果ID之间的访问间隔为45天,我想将其定义为新项目。然后我需要在每个ID中找到每个项目的开始和结束日期。下面是我的代码,用于查找开始日期和结束日期,但R中此代码的专业形式是什么?
(x是客户记录示例) 例如,如下客户:
x:
ID VisitingTime
2 Customer_001 2011-09-01
3 Customer_001 2011-09-22
4 Customer_001 2011-10-25
5 Customer_001 2011-11-29
6 Customer_001 2011-12-20
7 Customer_001 2012-01-13
8 Customer_001 2012-02-03
9 Customer_001 2012-02-24
10 Customer_001 2013-07-24
11 Customer_001 2013-08-08
12 Customer_001 2013-08-29
13 Customer_001 2013-09-12
14 Customer_001 2013-10-03
15 Customer_001 2013-10-24
我需要:
> start
[1] "2011-09-01" "2013-07-24"
> end
[1] "2012-02-24" "2013-10-24"
我的代码:
start <- x[1,2]
end <- x[nrow(x),2]
for (i in 1:(nrow(x)-1)){
if (difftime(x[i+1,2], x[i,2] , units = "days") > 45){
end <- c(x[i,2],end)
start <- c(start ,x[i+1,2])
}
}
dput(x)
structure(list(ID = c("Customer_001", "Customer_001", "Customer_001",
"Customer_001", "Customer_001", "Customer_001", "Customer_001",
"Customer_001", "Customer_001", "Customer_001", "Customer_001",
"Customer_001", "Customer_001", "Customer_001"), VisitingTime = structure(c(1314835200,
1316649600, 1319500800, 1322524800, 1324339200, 1326412800, 1328227200,
1330041600, 1374624000, 1375920000, 1377734400, 1378944000, 1380758400,
1382572800), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("ID",
"VisitingTime"), row.names = 2:15, class = "data.frame")
答案 0 :(得分:2)
我使用以下dplyr
单行:
> require(dplyr)
> x %.% group_by(ID) %.%
mutate(visit=cumsum(c(Inf,diff(VisitingTime))>45)) %.%
group_by(ID, visit) %.% summarise(end=max(VisitingTime),start=min(VisitingTime))
生成数据框:
ID visit end start
1 Customer_001 1 2012-02-24 2011-09-01
2 Customer_001 2 2013-10-24 2013-07-24
3 Customer_002 1 2012-02-24 2011-09-01
4 Customer_002 2 2013-10-24 2013-07-24
注意我已在具有多个客户ID的数据框架上对其进行了测试,以确保第一部分正常工作。
它是如何工作的?好吧,从您的数据开始,然后执行以下操作,在每个步骤后打印出x
:
x$DT = c(Inf, diff(x$VisitingTime))
x$begin = x$DT>45
x$visit = cumsum(x$begin)
您应该会看到x$visit
个小组每次访问。
单行使用dplyr
完成所有这些操作,然后继续获取每个访问组中的最小和最大数据。
作为进一步的测试我刚刚检查了如果我将差异测试为1天会发生什么,在这种情况下,我会为每条记录进行一次访问,如果差异超过9000天,在这种情况下,我只获得一次访问记录。 (我还调了一个愚蠢的错误,我在start
max
日期调用了{{1}},反之亦然。