我有两个数据帧;一个包含一年的小时温度,另一个包含航班信息。 Bellow显示了温度数据框的摘录:
DepartureTime <- c("2000-01-01 03:01:00","2000-01-01 10:00:00","2000-01-01 14:00:00")
DepartureTime <- as.POSIXct(DepartureTime)
FlightInformation <- data.frame(DepartureTime)
FlightInformation
DepartureTime
1 2000-01-01 03:01:00
2 2000-01-01 10:14:00
3 2000-01-01 14:55:00
Bellow显示航班信息数据框的摘录:
FlightInformation
DepartureTime Temp
1 2000-01-01 03:01:00 20
2 2000-01-01 10:14:00 10
3 2000-01-01 14:55:00 10
我的目标是获取FlightInformation $ DepartureTime的每一行,并在整个列Temperature $ Time中找到最接近的时间。然后我想将相应的温度添加到FlightInformation数据帧。所需的输出应如下所示:
i <- 1
j <- 1
while(i <= nrow(Temperature)){
while(j <= nrow(FlightInformation)){
if(Temperature$Time[i] == FlightInformation$Time[j]){
FlightInformation$Temp[j] == Temperature$Temp[i]
}
j <- j + 1
}
i <- i + 1
}
到目前为止,我的尝试已经提出来了:
{
这涉及首次舍入所有时间到最近的小时。这种方法不像我希望的那样准确,看起来非常低效!有没有一种简单的方法可以找到最近的posix来提供我想要的输出?
答案 0 :(得分:1)
一些假设:
NA
我们将使用cut
,找出值在breaks
系列中的值的区间:
(ind <- cut(FlightInformation$DepartureTime, Temperature$Time, labels = FALSE))
# [1] 1 2 NA
这些表示Temperature
中的行,我们应从中检索$Temp
。不幸的是,它是绝对的,不允许更接近下一个值,所以我们可以弥补:
(ind <- ind + (abs(Temperature$Time[ind] - FlightInformation$DepartureTime) >
abs(Temperature$Time[1+ind] - FlightInformation$DepartureTime)))
# [1] 1 3 NA
好的,现在NA
:表示最新的$DepartureTime
超出了已知时间。这表明违反了我上面的第一个假设,但可以修复。我在这里使用“6小时”的魔术常数来确定数据足够接近以便能够使用它;当然还有许多其他的启发式方法,这些方法都不会错。对于那些,我们可以假设最新的温度:
(is_recoverable <- is.na(ind) & abs(FlightInformation$DepartureTime - max(Temperature$Time)) < 60*60*6)
# [1] FALSE FALSE TRUE
ind[is_recoverable] <- nrow(Temperature)
ind
# [1] 1 3 3
结果:
FlightInformation$Temp <- Temperature$Temp[ ind ]
FlightInformation
# DepartureTime Temp
# 1 2000-01-01 03:01:00 20
# 2 2000-01-01 10:00:00 10
# 3 2000-01-01 14:00:00 10
虽然绝对比双while
循环更快,但如果温度数据中存在较大间隙,则会出现问题。也就是说,如果您的数据有3年的差距,将使用最近的温度,这可能是2。99年前。如需仔细检查,请使用:
FlightInformation$TempTime <- Temperature$Time[ ind ]
FlightInformation$TimeDelta <- with(FlightInformation, abs(TempTime - DepartureTime))
FlightInformation
# DepartureTime Temp TempTime TimeDelta
# 1 2000-01-01 03:01:00 20 2000-01-01 00:53:00 128 mins
# 2 2000-01-01 10:00:00 10 2000-01-01 10:53:00 53 mins
# 3 2000-01-01 14:00:00 10 2000-01-01 10:53:00 187 mins
您可以为时间增量使用不同的单位,并检查以下问题:
units(FlightInformation$TimeDelta) <- "secs"
which(FlightInformation$TimeDelta > 60*60*6)
# integer(0)
(其中integer(0)
表示你没有超出我6小时的魔法窗口。)
答案 1 :(得分:1)
这是一种方式!如果将时间转换为数值,则最容易使用时间。然后,您可以比较数值以查找参考时间之前/之后的最近时间(以下示例中为FlightInformation $ time_num)。一旦在参考值之前和之后有最接近的时间,找出哪个最接近您的参考值。使用该时间值查找(索引)正确的温度值并将其添加到数据框中。
#convert time to numeric (seconds since origin of time)
Temperature$time_num <- as.numeric(Temperature$Time)
FlightInformation$time_num <- as.numeric(FlightInformation$DepartureTime)
#make sure time data is in correct order so that indexes for time are in correct order
Temperature <- Temperature[with(Temperature, order(time_num)), ] #sort data
for (i in 1:nrow(FlightInformation)) #for each row of data in flight...
{
#find the time in Temp that is closest + prior to Flight time
#create a logical vector saying which Temperature$time_num are <= to FlightInformation$time_num.
#pull the max row index from the logical vector where value == TRUE (this is the closest time for Temp that is prior to Flight Time)
#use that row index to look up the Temperature$time_num value that is closest + prior to Flight time
#will return NA/warning message if no time in Temp is before time in Flight
temptime_prior <- Temperature[max(which(Temperature$time_num <= FlightInformation$time_num[i])), "time_num"]
#find the time in Temp that is closest + after to Flight time
#will return NA/warning message if no time in Temp is after time in Flight
temptime_after <- Temperature[min(which(Temperature$time_num > FlightInformation$time_num[i])), "time_num"]
#compare times before and after to see which is closest to flight time. If no before/after time was found (e.g., NA was returned), always use the other time value
temptime_closest <- ifelse(is.na(temptime_prior), temptime_after,
ifelse(is.na(temptime_after), temptime_prior,
ifelse((FlightInformation$time_num[i] - temptime_prior) <= (temptime_after - FlightInformation$time_num[i]),
temptime_prior, temptime_after)))
#look up the right temp by finding the row index of right Temp$time_num value and add it to Flight info
FlightInformation$Temp[i] <- Temperature[which(Temperature$time_num == temptime_closest), "Temp"]
}
#get rid of numeric time column, you don't need it anymore
FlightInformation <- FlightInformation[,!(names(FlightInformation) %in% c("time_num"))]
输出
DepartureTime Temp
1 2000-01-01 03:01:00 20
2 2000-01-01 10:00:00 10
3 2000-01-01 14:00:00 10
如果每个数据框中都有数据子集,则需要匹配(例如,仅将df1 $ group1时间值与df2 $ group1时间值匹配),您可以使用survival :: neardate。这是一个很好的功能,它基本上完成了上面的代码所做的,但如果你需要它还有一些额外的参数。
希望这有帮助!没有所有评论的代码要短得多=)