sqldf查询与日期差异

时间:2016-09-06 21:55:14

标签: r date merge difference sqldf

我正在尝试对以下两个数据库(摘录)上的颜色和日期差异执行(简化!)查询:

A                           B   
    A.COL   A.TIME              B.COL   B.TIME
1   blue    2009-01-31      1   blue    2007-01-31
2   blue    2009-02-28      2   blue    2008-12-31
3   blue    2009-03-31      3   blue    2009-02-28
4   blue    2009-04-30      4   blue    2009-04-30
5   blue    2009-05-31      5   blue    2009-06-30
6   blue    2009-06-30      6   blue    2016-08-31
7   blue    2016-03-31
8   blue    2016-04-30
9   red ...
10  red ...

我想做什么:根据COL合并表格以及TIME的差异,即两个时间之间的差异不得大于或小于2个月(或者换句话说,介于-2和+之间) 2,取决于从哪个日期开始)。

# For example starting with observation 1 from A, that would imply 2 matches:
2009-01-31 matched to 2008-12-31 (diff = 1)
2009-01-31 matched to 2009-02-28  (diff = -1)

# for obs 2 from A, that would imply 
2009-02-28 matched to 2008-12-31 (diff = 2)
2009-02-28 matched to 2009-02-28 (diff = 0)
2009-02-28 matched to 2009-04-30 (diff = -2)

等。 我正在考虑某种类型的日期差异函数,来自lubridate,这在几个月内少于30天并且有时会成为NAs的情况下会出现问题,或者来自as.yearmon的{​​{1}},这至少可以正确计算差异。但是,我无法正确地将其实现为zoo(错误:语句中的错误:接近“as”:语法错误)。原因似乎是不能使用sqldf的每个R函数。 任何想法如何在R中完成?我也在寻找一种如何相互减少月份的优雅方式。 lubridate存在这个问题: Add/subtract 6 months (bond time) in R using lubridate,但这里有一个提议的方法,如何使用sqldf完成它:Get the difference between dates in terms of weeks, months, quarters, and years

获取数据(感谢下面的@bouncyball代码):

zoo

1 个答案:

答案 0 :(得分:1)

以下是使用this SO postplyr包中的函数的解决方案:

library(plyr)

# turn a date into a 'monthnumber' relative to an origin
monnb <- function(d) { 
  lt <- as.POSIXlt(as.Date(d, origin="1900-01-01"))
  lt$year*12 + lt$mon 
  } 

# compute a month difference as a difference between two monnb's
mondf <- function(d1, d2) { monnb(d2) - monnb(d1) }

# iterate over rows of A looking for matches in B
adply(A, 1, function(x)
  B[x$A.COL == B$B.COL & 
      abs(mondf(as.Date(x$A.TIME), as.Date(B$B.TIME))) <= 2,]
)

#     A.COL    A.TIME  B.COL    B.TIME
# 1   blue 2009-01-31  blue 2008-12-31
# 2   blue 2009-01-31  blue 2009-02-28
# 3   blue 2009-02-28  blue 2008-12-31
# 4   blue 2009-02-28  blue 2009-02-28
# 5   blue 2009-02-28  blue 2009-04-30
#  ....

编辑:data.table实施

library(data.table)
merge_AB <- data.table(merge(A,B, by.x = 'A.COL', by.y = 'B.COL'))

merge_AB[,DateDiff := abs(mondf(A.TIME, B.TIME))
       ][DateDiff <= 2]

 #     A.COL     A.TIME     B.TIME DateDiff
 # 1:  blue 2009-01-31 2008-12-31        1
 # 2:  blue 2009-01-31 2009-02-28        1
 # 3:  blue 2009-02-28 2008-12-31        2
 # 4:  blue 2009-02-28 2009-02-28        0
 # 5:  blue 2009-02-28 2009-04-30        2
 # ...

数据

A <- read.table(
text = "
A.COL   A.TIME          
blue    2009-01-31     
blue    2009-02-28      
blue    2009-03-31      
blue    2009-04-30      
blue    2009-05-31      
blue    2009-06-30
blue    2016-03-31
blue    2016-04-30
", header = T, stringsAsFactors = FALSE)


B <- read.table(
  text = "
B.COL   B.TIME
blue    2007-01-31
blue    2008-12-31
blue    2009-02-28
blue    2009-04-30
blue    2009-06-30
blue    2016-08-31
", stringsAsFactors = FALSE, header = T)