R:用最近的时间组合两个数据帧

时间:2018-04-07 00:40:38

标签: r

我有两个数据帧;一个包含一年的小时温度,另一个包含航班信息。 Bellow显示了温度数据框的摘录:

  DepartureTime <- c("2000-01-01 03:01:00","2000-01-01 10:00:00","2000-01-01 14:00:00")
  DepartureTime <- as.POSIXct(DepartureTime)
  FlightInformation <- data.frame(DepartureTime)
  FlightInformation
        DepartureTime
1 2000-01-01 03:01:00
2 2000-01-01 10:14:00
3 2000-01-01 14:55:00

Bellow显示航班信息数据框的摘录:

FlightInformation
        DepartureTime Temp
1 2000-01-01 03:01:00 20
2 2000-01-01 10:14:00 10
3 2000-01-01 14:55:00 10

我的目标是获取FlightInformation $ DepartureTime的每一行,并在整个列Temperature $ Time中找到最接近的时间。然后我想将相应的温度添加到FlightInformation数据帧。所需的输出应如下所示:

  i <- 1
  j <- 1
  while(i <= nrow(Temperature)){
    while(j <= nrow(FlightInformation)){
      if(Temperature$Time[i] == FlightInformation$Time[j]){
        FlightInformation$Temp[j] == Temperature$Temp[i]
      }
      j <- j + 1
    }
    i <- i + 1
  }

到目前为止,我的尝试已经提出来了:

{

这涉及首次舍入所有时间到最近的小时。这种方法不像我希望的那样准确,看起来非常低效!有没有一种简单的方法可以找到最近的posix来提供我想要的输出?

2 个答案:

答案 0 :(得分:1)

一些假设:

  • 您在所有航班信息之前和之后都有温度数据;否则你会看到NA
  • 温度数据足够连续,意味着插值显示,你不会在3个月之前拿到东西(没用)
  • 订购温度数据(如果没有,则很容易修复)

我们将使用cut,找出值在breaks系列中的值的区间:

(ind <- cut(FlightInformation$DepartureTime, Temperature$Time, labels = FALSE))
# [1]  1  2 NA

这些表示Temperature中的行,我们应从中检索$Temp。不幸的是,它是绝对的,不允许更接近下一个值,所以我们可以弥补:

(ind <- ind + (abs(Temperature$Time[ind] - FlightInformation$DepartureTime) >
                 abs(Temperature$Time[1+ind] - FlightInformation$DepartureTime)))
# [1]  1  3 NA

好的,现在NA:表示最新的$DepartureTime超出了已知时间。这表明违反了我上面的第一个假设,但可以修复。我在这里使用“6小时”的魔术常数来确定数据足够接近以便能够使用它;当然还有许多其他的启发式方法,这些方法都不会错。对于那些,我们可以假设最新的温度:

(is_recoverable <- is.na(ind) & abs(FlightInformation$DepartureTime - max(Temperature$Time)) < 60*60*6)
# [1] FALSE FALSE  TRUE
ind[is_recoverable] <- nrow(Temperature)
ind
# [1] 1 3 3

结果:

FlightInformation$Temp <- Temperature$Temp[ ind ]
FlightInformation
#         DepartureTime Temp
# 1 2000-01-01 03:01:00   20
# 2 2000-01-01 10:00:00   10
# 3 2000-01-01 14:00:00   10

虽然绝对比双while循环更快,但如果温度数据中存在较大间隙,则会出现问题。也就是说,如果您的数据有3年的差距,将使用最近的温度,这可能是2。99年前。如需仔细检查,请使用:

FlightInformation$TempTime <- Temperature$Time[ ind ]
FlightInformation$TimeDelta <- with(FlightInformation, abs(TempTime - DepartureTime))
FlightInformation
#         DepartureTime Temp            TempTime TimeDelta
# 1 2000-01-01 03:01:00   20 2000-01-01 00:53:00  128 mins
# 2 2000-01-01 10:00:00   10 2000-01-01 10:53:00   53 mins
# 3 2000-01-01 14:00:00   10 2000-01-01 10:53:00  187 mins

您可以为时间增量使用不同的单位,并检查以下问题:

units(FlightInformation$TimeDelta) <- "secs"
which(FlightInformation$TimeDelta > 60*60*6)
# integer(0)

(其中integer(0)表示你没有超出我6小时的魔法窗口。)

答案 1 :(得分:1)

这是一种方式!如果将时间转换为数值,则最容易使用时间。然后,您可以比较数值以查找参​​考时间之前/之后的最近时间(以下示例中为FlightInformation $ time_num)。一旦在参考值之前和之后有最接近的时间,找出哪个最接近您的参考值。使用该时间值查找(索引)正确的温度值并将其添加到数据框中。

#convert time to numeric (seconds since origin of time)
Temperature$time_num <- as.numeric(Temperature$Time) 
FlightInformation$time_num <- as.numeric(FlightInformation$DepartureTime)

#make sure time data is in correct order so that indexes for time are in correct order 
Temperature <- Temperature[with(Temperature, order(time_num)), ] #sort data

for (i in 1:nrow(FlightInformation)) #for each row of data in flight...
{
  #find the time in Temp that is closest + prior to Flight time
  #create a logical vector saying which Temperature$time_num are <= to FlightInformation$time_num. 
  #pull the max row index from the logical vector where value == TRUE (this is the closest time for Temp that is prior to Flight Time)
  #use that row index to look up the Temperature$time_num value that is closest + prior to Flight time
  #will return NA/warning message if no time in Temp is before time in Flight
  temptime_prior <- Temperature[max(which(Temperature$time_num <= FlightInformation$time_num[i])), "time_num"] 

  #find the time in Temp that is closest + after to Flight time
  #will return NA/warning message if no time in Temp is after time in Flight
  temptime_after <- Temperature[min(which(Temperature$time_num > FlightInformation$time_num[i])), "time_num"] 

  #compare times before and after to see which is closest to flight time. If no before/after time was found (e.g., NA was returned), always use the other time value
  temptime_closest <- ifelse(is.na(temptime_prior), temptime_after, 
                             ifelse(is.na(temptime_after), temptime_prior, 
                                    ifelse((FlightInformation$time_num[i] - temptime_prior) <= (temptime_after - FlightInformation$time_num[i]),
                                           temptime_prior, temptime_after)))

  #look up the right temp by finding the row index of right Temp$time_num value and add it to Flight info
  FlightInformation$Temp[i] <- Temperature[which(Temperature$time_num == temptime_closest), "Temp"]
}

#get rid of numeric time column, you don't need it anymore
FlightInformation <- FlightInformation[,!(names(FlightInformation) %in% c("time_num"))]

输出

        DepartureTime Temp
1 2000-01-01 03:01:00   20
2 2000-01-01 10:00:00   10
3 2000-01-01 14:00:00   10

如果每个数据框中都有数据子集,则需要匹配(例如,仅将df1 $ group1时间值与df2 $ group1时间值匹配),您可以使用survival :: neardate。这是一个很好的功能,它基本上完成了上面的代码所做的,但如果你需要它还有一些额外的参数。

希望这有帮助!没有所有评论的代码要短得多=)