R - 如何按最近的时间日期连接两个数据帧?

时间:2016-09-02 00:39:29

标签: r datetime

我有2个数据集,每个数据集包含POSIXlt格式的日期时间值,以及一些其他数字和字符变量。

我想基于日期时间列合并两个数据集。 但是两个数据集的日期戳都不匹配,所以我需要按最近的日期(之前或之后)组合它们。 在我的例子中,2016-03-01 23:52:00的数据值“e”需要在2016-03-02 00:00:00与“binH”合并,而不是“binG”。

是否有一个函数可以让我按最接近的日期时间值组合我的数据集,即使它在之后?

我找到了使用cut()函数或data.tables中的roll = Inf函数将日期组合到下一个上一个日期的方法。但我无法将时间戳记录为任何格式roll ='nearest'接受。

    >df1
    date1 value
    1 2016-03-01 17:52:00     a
    2 2016-03-01 18:01:30     b
    3 2016-03-01 18:05:00     c
    4 2016-03-01 20:42:30     d
    5 2016-03-01 23:52:00     e

    >df2
    date2 bin_name
    1 2016-03-01 17:00:00     binA
    2 2016-03-01 18:00:00     binB
    3 2016-03-01 19:00:00     binC
    4 2016-03-01 20:00:00     binD
    5 2016-03-01 21:00:00     binE
    6 2016-03-01 22:00:00     binF
    7 2016-03-01 23:00:00     binG
    8 2016-03-02 00:00:00     binH
    9 2016-03-02 01:00:00     binI

2 个答案:

答案 0 :(得分:4)

data.table应该可以解决这个问题(你能解释一下你遇到的错误吗?),虽然它确实倾向于将POSIXlt自己转换为POSIXct(可能会手动将你的datetime列转换为让data.table感到高兴。另外,请确保在使用roll之前设置了键列。

(我在这里创建了自己的示例表,让我的生活变得更轻松。如果你想在你的上面使用dput,我很乐意用你的数据更新这个例子):

new <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00" ) ), data.new = c( "t","u","v" ) )
head( new, 2 )

                  date data.new
1: 2016-03-02 12:20:00        t
2: 2016-03-07 12:20:00        u

old <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2015-03-02 12:20:00" ) ), data.old = c( "a","b","c","d" ) )
head( old, 2 )


                  date data.old
1: 2016-03-02 12:20:00        a
2: 2016-03-07 12:20:00        b

setkey( new, date )
setkey( old, date )

combined <- new[ old, roll = "nearest" ]
combined

                  date data.new data.old
1: 2015-03-02 12:20:00        t        d
2: 2016-03-02 12:20:00        t        a
3: 2016-03-07 12:20:00        u        b
4: 2016-04-02 12:20:00        v        c

我故意使两个表的行长不同,以显示滚动连接如何处理多个匹配。您可以切换它加入的方式:

combined <- old[ new, roll = "nearest" ]
combined

                  date data.old data.new
1: 2016-03-02 12:20:00        a        t
2: 2016-03-07 12:20:00        b        u
3: 2016-04-02 12:20:00        c        v

答案 1 :(得分:0)

我遇到了类似的问题,但是我没有使用data.tabletidyverse来创建自己的函数amerge来实现“近似合并”。它包含4个参数:

  • 两个数据帧,
  • “ firm”(非近似)合并的列名称向量-这些必须存在于两个数据帧中,
  • 以及用于近似合并的单个列的名称(在两个数据框中)。它适用于任何数字值,包括日期。

这个想法是合并最佳匹配的第1到第1行,而不是从任何数据框中删除任何行。这是我的注释代码和一个有效示例。

amerge <- function(d1, d2, firm=NULL, approx=NULL) {
  rt = Sys.time()

  # Take care of conflicting column names
  n2 = data.frame(oldname = names(d2), newname = names(d2))
  n2$newname = as.character(n2$newname)
  n2$newname[(n2$oldname %in% names(d1)) & !(n2$oldname %in% firm)] =
    paste(n2$newname[(n2$oldname %in% names(d1)) & !(n2$oldname %in% firm)], "2", sep=".")

  # Add unique row IDs
  if (length(c(firm, approx))>1) {
    d1$ID1 = factor(apply(d1[,c(approx,firm)], 1, paste, collapse=" "))
    d2$ID2 = factor(apply(d2[,c(approx,firm)], 1, paste, collapse=" "))
  } else {
    d1$ID1 = factor(d1[,c(approx,firm)])
    d2$ID2 = factor(d2[,c(approx,firm)])
  }

  # Perform initial merge on the 'firm' parameters, if any
  # Otherwise match all to all
  if (length(firm)>0) {
    t1 = merge(d1, d2, by=firm, all=T, suff=c("",".2"))
  } else {
    names(d2)= c(n2$newname,"ID2")
    t1 = data.frame()
    for (i1 in 1:nrow(d1)) {
      trow = d1[i1,]
      t1 = rbind(t1, cbind(trow, d2))
    }
  }

  # Match by the most approximate record
  if (length(approx)==1) {
    # Calculate the differential for approximate merging
    t1$DIFF = abs(t1[,approx] - t1[,n2$newname[n2$oldname==approx]])
    # Sort data by ascending DIFF, so that best matching records are used first
    t1 = t1[order(t1$DIFF, t1$ID1, t1$ID2),]
    t2 = data.frame()
    d2$used = 0
    # For each record of d1, find match from d2
    for (i1 in na.omit(unique(t1$ID1))) {
      tx = t1[!is.na(t1$DIFF) & t1$ID1==i1,]
      # If there are non-missing records, get the one with minimum DIFF (top one)
      if (nrow(tx)>0) {
        tx = tx[1,]
        # If matching record found, remove it from the pool, so it's not used again
        t1[!is.na(t1$ID2) & t1$ID2==tx$ID2, c(n2$newname[!(n2$newname %in% firm)], "DIFF")] = NA
        # And mark it as used
        d2$used[d2$ID2==tx$ID2] = 1
      } else {
        # If there are no non-missing records, just get the first one from the top
        tx = t1[!is.na(t1$ID1) & t1$ID1==i1,][1,]
      }
      t2 = rbind(t2,tx)
    }
  } else {
    t2 = t1
  }
  # Make the records the same order as d1
  t2 = t2[match(d1$ID1, t2$ID1),]
  # Add unmatched records from d2 to the end of output
  if (any(d2$used==0)) {
    tx = t1[t1$ID2 %in% d2$ID2[d2$used==0], ]
    tx = tx[!duplicated(tx$ID2),]
    tx[, names(d1)[!(names(d1) %in% c(firm))]] = NA
    t2 = rbind(t2,tx)
    t2[is.na(t2[,approx]), approx] = t2[is.na(t2[,approx]), n2$newname[n2$oldname==approx]]
  }
  t2$DIFF = t2$ID1 = t2$ID2 = NULL
  cat("* Run time: ", round(difftime(Sys.time(),rt, "secs"),1), " seconds.\n", sep="")
  return(t2)
}

和示例:

new <- data.frame(ID=c(1,1,1,2), date = as.POSIXct( c("2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2016-04-12 11:03:00")), new = c("t","u","v","x"))
old <- data.frame(ID=c(1,1,1,1,1), date = as.POSIXct( c("2016-03-07 12:20:00", "2016-04-02 12:20:00", "2016-03-01 10:09:00", "2015-04-12 10:09:00","2016-03-03 12:20:00")), old = c("a","b","c","d","e"))

amerge(old, new, firm="ID", approx="date")

它输出:

   ID                date  old              date.2  new
2   1 2016-03-07 12:20:00    a 2016-03-07 12:20:00    u
6   1 2016-04-02 12:20:00    b 2016-04-02 12:20:00    v
7   1 2016-03-01 10:09:00    c                <NA> <NA>
10  1 2015-04-12 10:09:00    d                <NA> <NA>
13  1 2016-03-03 12:20:00    e 2016-03-02 12:20:00    t
16  2 2016-04-12 11:03:00 <NA> 2016-04-12 11:03:00    x

因此可以按我的预期目的工作-两个数据帧中的每一行都有一个副本-匹配最短的时间差。注意事项:该函数将date.2复制到date丢失的date列中。