我想使用他们的开始和结束日期计算一段时间内的项目数。
一些示例数据
START <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
END <- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
df <- data.frame(START,END)
df
给出
START END
1 2014-01-01 2014-01-04
2 2014-01-02 2014-01-03
3 2014-01-03 2014-01-03
4 2014-01-03 2014-01-04
显示这些项目的时间计数(基于其开始和结束时间)的表格如下:
DATETIME COUNT
2014-01-01 1
2014-01-02 2
2014-01-03 4
2014-01-04 2
这可以使用R来完成,特别是使用dplyr吗?非常感谢。
答案 0 :(得分:6)
这样做。您可以根据需要更改列名称。
as.data.frame(table(Reduce(c, Map(seq, df$START, df$END, by = 1))))
# Var1 Freq
# 1 2014-01-01 1
# 2 2014-01-02 2
# 3 2014-01-03 4
# 4 2014-01-04 2
正如评论中所述,上述解决方案中的Var1
现在是一个因素,而不是日期。要将日期类保留在第一列中,您可以对上述解决方案做更多工作,或使用plyr::count
代替as.data.frame(table(...))
library(plyr)
count(Reduce(c, Map(seq, df$START, df$END, by = 1)))
# x freq
# 1 2014-01-01 1
# 2 2014-01-02 2
# 3 2014-01-03 4
# 4 2014-01-04 2
答案 1 :(得分:2)
您可以使用data.table
library(data.table)
DT <- setDT(df)[, list(DATETIME= seq(START, END, by=1)), by=1:nrow(df)][,
list(COUNT=.N), by=DATETIME]
DT
# DATETIME COUNT
#1: 2014-01-01 1
#2: 2014-01-02 2
#3: 2014-01-03 4
#4: 2014-01-04 2
从版本1.9.4 +开始,您还可以使用函数foverlaps()
执行&#34;重叠连接&#34;。它的效率更高,因为它不必先扩展每一行的日期,然后再计算。以下是:
require(data.table) ## 1.9.4
setDT(df) ## convert your data.frame to data.table by reference
## 1. Some preprocessing:
# create a lookup - the dates for which you need the count, and set key
dates = seq(as.Date("2014-01-01"), as.Date("2014-01-04"), by="days")
lookup = data.table(START=dates, END=dates, key=c("START", "END"))
## 2. Now find overlapping coordinates
# for each row in `df` get all the rows it overlaps with in `lookup`
ans = foverlaps(df, lookup, type="any", which=TRUE)
现在,我们只需按yid
(= lookup
中的索引)进行分组并计算:
## 3. count
ans[, .N, by=yid]
# yid N
# 1: 1 1
# 2: 2 2
# 3: 3 4
# 4: 4 2
第一列对应lookup
中的行号。如果缺少某些数字,则计数为0。
答案 2 :(得分:1)
使用dplyr和分组数据:
data_frame(
START = as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03")),
END = as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
) -> df
rbind(cbind(group = 'a', df),cbind(group = 'b', df)) %>% as_data_frame->df
df
df %>%
group_by(.,group) %>%
do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1)))))
当您想要在每个用户的时间间隔内查找不同页面/机器等上的登录次数时,这是一个常见问题
> df
Source: local data frame [8 x 3]
group START END
(chr) (date) (date)
1 a 2014-01-01 2014-01-04
2 a 2014-01-02 2014-01-03
3 a 2014-01-03 2014-01-03
4 a 2014-01-03 2014-01-04
5 b 2014-01-01 2014-01-04
6 b 2014-01-02 2014-01-03
7 b 2014-01-03 2014-01-03
8 b 2014-01-03 2014-01-04
>
> df %>%
+ group_by(.,group) %>%
+ do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1)))))
Source: local data frame [8 x 3]
Groups: group [2]
group Var1 Freq
(chr) (fctr) (int)
1 a 2014-01-01 1
2 a 2014-01-02 2
3 a 2014-01-03 4
4 a 2014-01-04 2
5 b 2014-01-01 1
6 b 2014-01-02 2
7 b 2014-01-03 4
8 b 2014-01-04 2
答案 3 :(得分:0)
使用dplyr
和foreach
:
library(dplyr)
library(foreach)
df <- data.frame(START = as.Date(c("2014-01-01",
"2014-01-02",
"2014-01-03",
"2014-01-03")),
END = as.Date(c("2014-01-04",
"2014-01-03",
"2014-01-03",
"2014-01-04")))
df
r <- foreach(DATETIME = seq(min(df$START), max(df$END), by = 1),
.combine = rbind) %do% {
df %>%
filter(DATETIME >= START & DATETIME <= END) %>%
summarise(DATETIME, COUNT = n())
}
r
答案 4 :(得分:0)
我刚刚提出了另一种基于lubridate的解决方案,该解决方案在更新的相关SO帖子here
中对于具有宽数据范围的较大数据框而言更快