我想要特定日期之间的计数(行)的sum()。我在堆栈上找到了一些解决方案,但要点是我的第二个数据帧比第一个数据帧大得多。
数据集一
dim(foo1)#600/2
Start End
2017-10-24 22:33:59 2017-10-24 22:43:59
2017-11-13 06:34:59 2017-11-13 06:44:59
2017-11-13 06:52:00 2017-11-13 07:02:00
2017-11-13 07:16:59 2017-11-13 07:26:59
2017-11-13 07:35:59 2017-11-13 07:45:59
数据集二
dim(foo2)#60.000 / 2
Count Time
1 2017-10-01 13:45:02
1 2017-10-01 12:53:23
1 2017-10-01 12:20:56
1 2017-10-01 12:31:12
我想要foo2中所有行(计数)的总和出现在foo1中的开始日期和结束日期之间)。结果应为Foo1 + new_column(包含计数)
这是我最初无法解决的“解决方案”:
for(i in 1:nrow(foo1)){
foo1$new_column[i] <-sum(foo2$Count[which(
foo2$Time >= foo2$Start[i] &
foo2$Time <= foo2$End[i])])
}
答案 0 :(得分:1)
您的样本数据似乎存在问题,因为Time
中的foo2
(全部在2017年10月1日)不在foo1
的时间间隔内(范围始于2017-10-24)。
为此,我创建了自己的示例数据。
library(data.table)
foo1 <- data.table( Start = c("2017-10-24 22:33:59", "2017-11-13 06:34:59", "2017-11-13 06:52:00", "2017-11-13 07:16:59", "2017-11-13 07:35:59"),
End = c("2017-10-24 22:43:59", "2017-11-13 06:44:59", "2017-11-13 07:02:00", "2017-11-13 07:26:59", "2017-11-13 07:45:59"),
stringsAsFactors = FALSE)
# Start End
# 1: 2017-10-24 22:33:59 2017-10-24 22:43:59
# 2: 2017-11-13 06:34:59 2017-11-13 06:44:59
# 3: 2017-11-13 06:52:00 2017-11-13 07:02:00
# 4: 2017-11-13 07:16:59 2017-11-13 07:26:59
# 5: 2017-11-13 07:35:59 2017-11-13 07:45:59
foo2 <- data.table( Count = c(1,1,1,1),
Time = c("2017-10-24 22:37:02", "2017-10-24 22:38:23", "2017-11-13 07:20:56", "2017-10-01 12:31:12"),
stringsAsFactors = FALSE)
# Count Time
# 1: 1 2017-10-24 22:37:02
# 2: 1 2017-10-24 22:38:23
# 3: 1 2017-11-13 07:20:56
# 4: 1 2017-10-01 12:31:12
#set times as POSIXct
foo1[, Start := as.POSIXct(Start, format = "%Y-%m-%d %H:%M:%S")]
foo1[, End := as.POSIXct(End, format = "%Y-%m-%d %H:%M:%S")]
foo2[, Time := as.POSIXct(Time, format = "%Y-%m-%d %H:%M:%S")]
#add a dummy-column to create a time-range (of 1 second)
foo2[, dummy := Time]
#set data.table keys
setkey(foo1, Start, End)
setkey(foo2, Time, dummy)
#overlap-join, lose the dummy-column
foo3 <- foverlaps(foo2, foo1, type = "within", mult = "first", nomatch = 0L)[, dummy := NULL]
# Start End Count Time
# 1: 2017-10-24 22:33:59 2017-10-24 22:43:59 1 2017-10-24 22:37:02
# 2: 2017-10-24 22:33:59 2017-10-24 22:43:59 1 2017-10-24 22:38:23
# 3: 2017-11-13 07:16:59 2017-11-13 07:26:59 1 2017-11-13 07:20:56
foo3[, sum(Count), by = "Start"]
# Start V1
# 1: 2017-10-24 22:33:59 2
# 2: 2017-11-13 07:16:59 1
答案 1 :(得分:0)
由于您的原始数据集似乎没有任何重叠,因此在示例中添加了另一行。我使用dplyr
mutate添加了一个列,其中包含对每个between
和Start
的逐行End
比较到foo2$Time
的整个列表,然后将foo2$Count
作为结果集。
library(dplyr)
foo2 <- foo2 %>% add_row(Count = 3, Time = as.Date("2017-10-24 22:35:00", tz = "UTC"))
foo1 %>% rowwise() %>% mutate(Count = sum(foo2$Count[between(as.Date(foo2$Time), as.Date(Start), as.Date(End))]))
# Source: local data frame [500 x 3]
# Groups: <by row>
#
# A tibble: 500 x 3
# Start End Count
# <dttm> <dttm> <dbl>
# 1 2017-10-24 22:33:59 2017-10-24 22:43:59 3.00
# 2 2017-11-13 06:34:59 2017-11-13 06:44:59 0
# 3 2017-11-13 06:52:00 2017-11-13 07:02:00 0
# 4 2017-11-13 07:16:59 2017-11-13 07:26:59 0
# 5 2017-11-13 07:35:59 2017-11-13 07:45:59 0
# 6 2017-11-13 09:46:00 2017-11-13 09:56:00 0
# 7 2017-11-13 10:46:00 2017-11-13 10:56:00 0
# 8 2017-11-13 11:11:00 2017-11-13 11:21:00 0
# 9 2017-11-13 13:33:00 2017-11-13 13:43:00 0
# 10 2017-11-13 13:50:59 2017-11-13 14:00:59 0
# # ... with 490 more rows