我有两个data.tables,我希望Xdata的日期时间大于Ydata的StartTime且小于Ydata的EndTime。
我试图写一个练习,但是似乎丢失了数据。
library(data.table)
xdata=data.table(First=c("X1","X2","X3","X1","X3","X2"),
Second=c("A1","A2","B3","A1","B3","C4"),
Time=c("2018-09-01 09:21:03","2018-10-15 20:24:59","2018-10-15 12:06:46",
"2018-10-16 18:21:11","2018-10-16 21:21:12","2018-10-17 00:00:01"))
ydata=data.table(ID=c("YY","ZZ","AA","HH"),
StartTime=c("2018-08-21 08:00:00","2018-09-01 08:00:00",
"2018-10-15 08:00:00","2018-10-18 08:00:00"),
EndTime=c("2018-08-21 21:20:00","2018-09-01 21:20:00",
"2018-10-15 21:20:00","2018-10-18 21:20:00"))
library(dplyr)
outputXY <- xdata %>% filter(Time > ydata$StartTime & Time < ydata$EndTime)
仅此输出
1 X3 B3 2018-10-15 12:06:46
但是我需要的是
1 X1 A1 2018-09-01 09:21:03
2 X3 B3 2018-10-15 12:06:46
我试图修改代码,但结果相同
outputXY <- xdata[Time > ydata$StartTime & Time < ydata$EndTime]
如何修改它并做我想做的事?
答案 0 :(得分:2)
如果我理解正确,则OP希望在xdata
中找到 all 行,其中Time
位于给定间隔的 any 之内({ StartTime
中的{1}},EndTime
)。
data.table软件包中的ydata
函数就是为此目的而构建的。由于OP要求使用打开间隔(inrange()
),我们需要告诉Time > ydata$StartTime & Time < ydata$EndTime
排除端点。
inrange()
library(data.table) # coerce to POSIXct to allow for comparison operations xdata[, Time := as.POSIXct(Time)] tcols <- c("StartTime", "EndTime") ydata[, (tcols) := lapply(.SD, as.POSIXct), .SDcols = tcols] # subsetting with open intervals xdata[inrange(Time, ydata$StartTime, ydata$EndTime, incbounds = FALSE)]
因此, First Second Time
1: X1 A1 2018-09-01 09:21:03
2: X2 A2 2018-10-15 20:24:59
3: X3 B3 2018-10-15 12:06:46
的三行都符合条件。
如果OP要求提供封闭间隔(xdata
),我们可以使用Time >= ydata$StartTime & Time <= ydata$EndTime
的内联版本:
inrange()
答案 1 :(得分:1)
您需要考虑如何合并这两个数据集。现在,我最好的猜测是,您希望所有xdata时间都在ydata开始时间和结束时间的任何组合之间。但是您的代码正在处理矢量,因此它正在检查每个矢量元素是否通过大于和小于测试。
让我们展示数据如何按照您的方式排列:
xdata$Time ydata$StartTime ydata$EndTime
"2018-09-01 09:21:03" "2018-08-21 08:00:00" "2018-08-21 21:20:00"
"2018-10-15 20:24:59" "2018-09-01 08:00:00" "2018-09-01 21:20:00"
"2018-10-15 12:06:46" "2018-10-15 08:00:00" "2018-10-15 21:20:00"
"2018-10-16 18:21:11" "2018-10-18 08:00:00" "2018-10-18 21:20:00"
"2018-10-16 21:21:12" "2018-08-21 08:00:00" "2018-08-21 21:20:00" # recycled
"2018-10-17 00:00:01" "2018-09-01 08:00:00" "2018-09-01 21:20:00" # recycled
请注意,当数据与矢量元素并排显示时,您会看到满足条件的唯一行是"2018-10-15 12:06:46" "2018-10-15 08:00:00" "2018-10-15 21:20:00"
...
执行此操作的一种方法是使用CJ
函数创建一个Time.StartTime所有组合的data.table。然后,我们可以查找该时间是否在任何可能的时间范围内。
# Create a table with all combinations to Time and StartTime
timecheck <- CJ(Time = xdata$Time,StartTime = ydata$StartTime)
# Join in the EndTime
timecheck <- merge(timecheck,ydata,by = "StartTime")
# Use vector math to check if the Time is between StartTime and EndTime
# for every comination of possibilities.
timecheck[,in_range := (Time > StartTime & Time < EndTime)]
# group_by Time and create a summary of whether or not that time is in
# any range
timecheck <- timecheck[,any(in_range),.(Time)]
outputXY <- xdata %>% filter(timecheck$V1)
这将为您提供输出:
First Second Time
1 X1 A1 2018-09-01 09:21:03
2 X2 A2 2018-10-15 20:24:59
3 X3 B3 2018-10-15 12:06:46
我建议您运行代码的每个步骤,并查看每个中间步骤中存储的内容。另外,还有其他方法可以使用循环来执行此操作,这可能会占用较少的内存,但不会利用向量运算。
答案 2 :(得分:0)
也许是这样吗? -假设每天的时间范围相同:
编辑:仅考虑ydata
中存在的日期
library(data.table)
xdata=data.table(First=c("X1","X2","X3","X1","X3","X2"),
Second=c("A1","A2","B3","A1","B3","C4"),
Time=c("2018-09-01 09:21:03","2018-10-15 20:24:59","2018-10-15 12:06:46",
"2018-10-16 18:21:11","2018-10-16 21:21:12","2018-10-17 00:00:01"))
ydata=data.table(ID=c("YY","ZZ","AA","HH"),
StartTime=c("2018-08-21 08:00:00","2018-09-01 08:00:00",
"2018-10-15 08:00:00","2018-10-18 08:00:00"),
EndTime=c("2018-08-21 21:20:00","2018-09-01 21:20:00",
"2018-10-15 21:20:00","2018-10-18 21:20:00"))
xdata[, Date := as.Date(Time)]
ydata[, Date := as.Date(StartTime)]
xdata <- xdata[ydata, on = "Date", nomatch = 0]
outputXY <- xdata[Time > StartTime & Time < EndTime]
outputXY[, c("Date", "StartTime", "EndTime", "ID") := NULL]
print(outputXY)
但是结果将是:
First Second Time
1: X1 A1 2018-09-01 09:21:03
2: X2 A2 2018-10-15 20:24:59
3: X3 B3 2018-10-15 12:06:46