R-使用开始和结束日期计算一段时间内的项目数

时间:2014-10-10 00:58:15

标签: r duration dplyr

我想使用他们的开始和结束日期计算一段时间内的项目数。

一些示例数据

START <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
END <- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
df <- data.frame(START,END)
df

给出

       START        END
1 2014-01-01 2014-01-04
2 2014-01-02 2014-01-03
3 2014-01-03 2014-01-03
4 2014-01-03 2014-01-04

显示这些项目的时间计数(基于其开始和结束时间)的表格如下:

DATETIME    COUNT
2014-01-01   1 
2014-01-02   2 
2014-01-03   4 
2014-01-04   2 

这可以使用R来完成,特别是使用dplyr吗?非常感谢。

5 个答案:

答案 0 :(得分:6)

这样做。您可以根据需要更改列名称。

as.data.frame(table(Reduce(c, Map(seq, df$START, df$END, by = 1))))
#         Var1 Freq
# 1 2014-01-01    1
# 2 2014-01-02    2
# 3 2014-01-03    4
# 4 2014-01-04    2

正如评论中所述,上述解决方案中的Var1现在是一个因素,而不是日期。要将日期类保留在第一列中,您可以对上述解决方案做更多工作,或使用plyr::count代替as.data.frame(table(...))

library(plyr)
count(Reduce(c, Map(seq, df$START, df$END, by = 1)))
#            x freq
# 1 2014-01-01    1
# 2 2014-01-02    2
# 3 2014-01-03    4
# 4 2014-01-04    2

答案 1 :(得分:2)

您可以使用data.table

library(data.table)
DT <- setDT(df)[, list(DATETIME= seq(START, END, by=1)), by=1:nrow(df)][,
                           list(COUNT=.N), by=DATETIME]
 DT
 #     DATETIME COUNT
 #1: 2014-01-01     1
 #2: 2014-01-02     2
 #3: 2014-01-03     4
 #4: 2014-01-04     2

从版本1.9.4 +开始,您还可以使用函数foverlaps()执行&#34;重叠连接&#34;。它的效率更高,因为它不必先扩展每一行的日期,然后再计算。以下是:

require(data.table) ## 1.9.4
setDT(df) ## convert your data.frame to data.table by reference

## 1. Some preprocessing:
# create a lookup - the dates for which you need the count, and set key
dates = seq(as.Date("2014-01-01"), as.Date("2014-01-04"), by="days")
lookup = data.table(START=dates, END=dates, key=c("START", "END"))

## 2. Now find overlapping coordinates 
# for each row in `df` get all the rows it overlaps with in `lookup`
ans = foverlaps(df, lookup, type="any", which=TRUE)

现在,我们只需按yid(= lookup中的索引)进行分组并计算:

## 3. count
ans[, .N, by=yid]
#    yid N
# 1:   1 1
# 2:   2 2
# 3:   3 4
# 4:   4 2

第一列对应lookup中的行号。如果缺少某些数字,则计数为0。

答案 2 :(得分:1)

使用dplyr和分组数据:

data_frame(
            START = as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03")),
            END   = as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
           ) -> df
rbind(cbind(group = 'a', df),cbind(group = 'b', df)) %>% as_data_frame->df
df

df %>% 
  group_by(.,group) %>% 
  do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1)))))

当您想要在每个用户的时间间隔内查找不同页面/机器等上的登录次数时,这是一个常见问题

> df
Source: local data frame [8 x 3]

  group      START        END
  (chr)     (date)     (date)
1     a 2014-01-01 2014-01-04
2     a 2014-01-02 2014-01-03
3     a 2014-01-03 2014-01-03
4     a 2014-01-03 2014-01-04
5     b 2014-01-01 2014-01-04
6     b 2014-01-02 2014-01-03
7     b 2014-01-03 2014-01-03
8     b 2014-01-03 2014-01-04
> 
> df %>% 
+   group_by(.,group) %>% 
+   do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1)))))
Source: local data frame [8 x 3]
Groups: group [2]

  group       Var1  Freq
  (chr)     (fctr) (int)
1     a 2014-01-01     1
2     a 2014-01-02     2
3     a 2014-01-03     4
4     a 2014-01-04     2
5     b 2014-01-01     1
6     b 2014-01-02     2
7     b 2014-01-03     4
8     b 2014-01-04     2

答案 3 :(得分:0)

使用dplyrforeach

library(dplyr)
library(foreach)

df <- data.frame(START = as.Date(c("2014-01-01",
                                   "2014-01-02",
                                   "2014-01-03",
                                   "2014-01-03")),
                 END = as.Date(c("2014-01-04",
                                 "2014-01-03",
                                 "2014-01-03",
                                 "2014-01-04")))
df

r <- foreach(DATETIME = seq(min(df$START), max(df$END), by = 1),
             .combine = rbind) %do% {
  df %>%
    filter(DATETIME >= START & DATETIME <= END) %>%
    summarise(DATETIME, COUNT = n())
}
r

答案 4 :(得分:0)

我刚刚提出了另一种基于lubridate的解决方案,该解决方案在更新的相关SO帖子here

中对于具有宽数据范围的较大数据框而言更快