根据差距定义日期组,然后在组内查找开始日期和结束日期

时间:2014-09-04 06:54:48

标签: r traversal

我有一个大型数据框,可以按不同的客户ID(ID)进行分组。每个ID都有几个访问日期(VisitingTime)。如果ID之间的访问间隔为45天,我想将其定义为新项目。然后我需要在每个ID中找到每个项目的开始和结束日期。下面是我的代码,用于查找开始日期和结束日期,但R中此代码的专业形式是什么?

(x是客户记录示例) 例如,如下客户:

x:
             ID VisitingTime
2  Customer_001   2011-09-01
3  Customer_001   2011-09-22
4  Customer_001   2011-10-25
5  Customer_001   2011-11-29
6  Customer_001   2011-12-20
7  Customer_001   2012-01-13
8  Customer_001   2012-02-03
9  Customer_001   2012-02-24
10 Customer_001   2013-07-24
11 Customer_001   2013-08-08
12 Customer_001   2013-08-29
13 Customer_001   2013-09-12
14 Customer_001   2013-10-03
15 Customer_001   2013-10-24

我需要:

> start
[1] "2011-09-01" "2013-07-24"
> end
[1] "2012-02-24"  "2013-10-24"

我的代码:

start <- x[1,2]
end <- x[nrow(x),2]

for (i in 1:(nrow(x)-1)){
  if (difftime(x[i+1,2], x[i,2] , units = "days") >  45){
    end <- c(x[i,2],end)
    start <- c(start ,x[i+1,2])
  }  
}

dput(x)
structure(list(ID = c("Customer_001", "Customer_001", "Customer_001",
"Customer_001", "Customer_001", "Customer_001", "Customer_001",
"Customer_001", "Customer_001", "Customer_001", "Customer_001",
"Customer_001", "Customer_001", "Customer_001"), VisitingTime = structure(c(1314835200,
1316649600, 1319500800, 1322524800, 1324339200, 1326412800, 1328227200,
1330041600, 1374624000, 1375920000, 1377734400, 1378944000, 1380758400,
1382572800), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("ID",
"VisitingTime"), row.names = 2:15, class = "data.frame")

1 个答案:

答案 0 :(得分:2)

我使用以下dplyr单行:

> require(dplyr)
> x %.% group_by(ID) %.% 
  mutate(visit=cumsum(c(Inf,diff(VisitingTime))>45)) %.% 
  group_by(ID, visit) %.% summarise(end=max(VisitingTime),start=min(VisitingTime))

生成数据框:

            ID visit        end      start
1 Customer_001     1 2012-02-24 2011-09-01
2 Customer_001     2 2013-10-24 2013-07-24
3 Customer_002     1 2012-02-24 2011-09-01
4 Customer_002     2 2013-10-24 2013-07-24

注意我已在具有多个客户ID的数据框架上对其进行了测试,以确保第一部分正常工作。

它是如何工作的?好吧,从您的数据开始,然后执行以下操作,在每个步骤后打印出x

x$DT = c(Inf, diff(x$VisitingTime))
x$begin = x$DT>45
x$visit = cumsum(x$begin)

您应该会看到x$visit个小组每次访问。

单行使用dplyr完成所有这些操作,然后继续获取每个访问组中的最小和最大数据。

作为进一步的测试我刚刚检查了如果我将差异测试为1天会发生什么,在这种情况下,我会为每条记录进行一次访问,如果差异超过9000天,在这种情况下,我只获得一次访问记录。 (我还调了一个愚蠢的错误,我在start max日期调用了{{1}},反之亦然。