我在建立逻辑上有问题,无法完成这项工作。对于此特定问题,在堆栈/网络上找不到任何内容。
我有两个数据框:
数据框架一:
ID Date Time
1 2017-11-13 06:34:50
2 2017-11-13 06:40:10
3 2017-11-14 23:58:10
第二个数据框:
Number_Visitors hit_time
20 2017-11-13 06:34:50
18 2017-11-13 06:34:50
15 2017-11-15 00:06:10
25 2018-12-14 20:58:10
我想要什么?
我想让表2中的Number_Visitors与表1中的日期和时间匹配。但最困难的是:日期/时间(来自表1)之间的所有访问者+ 10分钟范围(开始时间+ 10分钟之间的所有访问者)。
ID Date Time End_Time #I don't have this column yet..
1 2017-11-13 06:34:50 06:44:50
2 2017-11-13 06:40:10 06:50:10
3 2017-11-14 23:58:10 00:08:10 #Attention: it is one day later here.
结果:
ID Date Time End_Time Number_of_Visitors_in_range
1 2017-11-13 06:34:50 06:44:50 28
2 2017-11-13 06:40:10 06:50:10 0
3 2017-11-14 23:58:10 00:08:10 15
答案 0 :(得分:2)
可能有多个答案。非等式联接/模糊联接是搜索项。
根据您的示例(而不是dput),可以使用类似以下的内容。代码中的解释。
dplyr / Fuzzyjoin:
//a[translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') = 'linkname']
data.table:
library(dplyr)
library(lubridate)
library(fuzzyjoin)
# set hit_time as posixct
df2$hit_time <- ymd_hms(df2$hit_time)
# first create 2 new columns so start and end match hit_time in other data.frame
df1 <- df1 %>% mutate(Start_Time = ymd_hms(paste0(Date, Time)),
End_Time = Start_Time + minutes(10))
# use fuzzy join and join everything together and roll up.
fuzzy_left_join(df1, df2, c(Start_Time = "hit_time", End_Time = "hit_time"),
list(`<=`,`>=`)) %>%
group_by(ID, Start_Time, End_Time) %>%
summarise(No_Visitors_in_range = sum(Number_Visitors))
# A tibble: 3 x 4
# Groups: ID, Start_Time [?]
ID Start_Time End_Time No_Visitors_in_range
<int> <dttm> <dttm> <int>
1 1 2017-11-13 06:34:50 2017-11-13 06:44:50 38
2 2 2017-11-13 06:40:10 2017-11-13 06:50:10 NA
3 3 2017-11-14 23:58:10 2017-11-15 00:08:10 15
数据:
library(data.table)
library(lubridate)
# set hit_time as posixct
df2$hit_time <- ymd_hms(df2$hit_time)
df1 <- as.data.table(df1)
df2 <- as.data.table(df2)
# first create 2 new columns so start and end match hit_time in other data.frame
df1[, Start_Time := ymd_hms(paste0(Date, Time))][, End_Time := Start_Time + minutes(10)]
# add sum of bbb to table 1 from table 2
df1[, No_Visitors_in_range := df2[df1, on=.(hit_time >= Start_Time, hit_time <= End_Time), sum(Number_Visitors), by=.EACHI]$V1]
df1
ID Date Time Start_Time End_Time No_Visitors_in_range
1: 1 2017-11-13 06:34:50 2017-11-13 06:34:50 2017-11-13 06:44:50 38
2: 2 2017-11-13 06:40:10 2017-11-13 06:40:10 2017-11-13 06:50:10 NA
3: 3 2017-11-14 23:58:10 2017-11-14 23:58:10 2017-11-15 00:08:10 15
编辑: 根据重叠的时间范围,最好先开始时间。
df1 <- structure(list(ID = 1:3, Date = c("2017-11-13", "2017-11-13",
"2017-11-14"), Time = c("06:34:50", "06:40:10", "23:58:10")), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(Number_Visitors = c(20L, 18L, 15L, 25L), hit_time = c("2017-11-13 06:34:50", "2017-11-13 06:34:50", "2017-11-15 00:06:10", "2018-12-14 20:58:10"
)), class = "data.frame", row.names = c(NA, -4L))
我在这里得到了警告,也许你也会这样做,这没什么好担心的,here对此进行了解释。