我正在尝试对以下两个数据库(摘录)上的颜色和日期差异执行(简化!)查询:
A B
A.COL A.TIME B.COL B.TIME
1 blue 2009-01-31 1 blue 2007-01-31
2 blue 2009-02-28 2 blue 2008-12-31
3 blue 2009-03-31 3 blue 2009-02-28
4 blue 2009-04-30 4 blue 2009-04-30
5 blue 2009-05-31 5 blue 2009-06-30
6 blue 2009-06-30 6 blue 2016-08-31
7 blue 2016-03-31
8 blue 2016-04-30
9 red ...
10 red ...
我想做什么:根据COL合并表格以及TIME的差异,即两个时间之间的差异不得大于或小于2个月(或者换句话说,介于-2和+之间) 2,取决于从哪个日期开始)。
# For example starting with observation 1 from A, that would imply 2 matches:
2009-01-31 matched to 2008-12-31 (diff = 1)
2009-01-31 matched to 2009-02-28 (diff = -1)
# for obs 2 from A, that would imply
2009-02-28 matched to 2008-12-31 (diff = 2)
2009-02-28 matched to 2009-02-28 (diff = 0)
2009-02-28 matched to 2009-04-30 (diff = -2)
等。
我正在考虑某种类型的日期差异函数,来自lubridate
,这在几个月内少于30天并且有时会成为NAs的情况下会出现问题,或者来自as.yearmon
的{{1}},这至少可以正确计算差异。但是,我无法正确地将其实现为zoo
(错误:语句中的错误:接近“as”:语法错误)。原因似乎是不能使用sqldf的每个R函数。
任何想法如何在R中完成?我也在寻找一种如何相互减少月份的优雅方式。 lubridate存在这个问题:
Add/subtract 6 months (bond time) in R using lubridate,但这里有一个提议的方法,如何使用sqldf
完成它:Get the difference between dates in terms of weeks, months, quarters, and years
获取数据(感谢下面的@bouncyball代码):
zoo
答案 0 :(得分:1)
以下是使用this SO post和plyr
包中的函数的解决方案:
library(plyr)
# turn a date into a 'monthnumber' relative to an origin
monnb <- function(d) {
lt <- as.POSIXlt(as.Date(d, origin="1900-01-01"))
lt$year*12 + lt$mon
}
# compute a month difference as a difference between two monnb's
mondf <- function(d1, d2) { monnb(d2) - monnb(d1) }
# iterate over rows of A looking for matches in B
adply(A, 1, function(x)
B[x$A.COL == B$B.COL &
abs(mondf(as.Date(x$A.TIME), as.Date(B$B.TIME))) <= 2,]
)
# A.COL A.TIME B.COL B.TIME
# 1 blue 2009-01-31 blue 2008-12-31
# 2 blue 2009-01-31 blue 2009-02-28
# 3 blue 2009-02-28 blue 2008-12-31
# 4 blue 2009-02-28 blue 2009-02-28
# 5 blue 2009-02-28 blue 2009-04-30
# ....
data.table
实施library(data.table)
merge_AB <- data.table(merge(A,B, by.x = 'A.COL', by.y = 'B.COL'))
merge_AB[,DateDiff := abs(mondf(A.TIME, B.TIME))
][DateDiff <= 2]
# A.COL A.TIME B.TIME DateDiff
# 1: blue 2009-01-31 2008-12-31 1
# 2: blue 2009-01-31 2009-02-28 1
# 3: blue 2009-02-28 2008-12-31 2
# 4: blue 2009-02-28 2009-02-28 0
# 5: blue 2009-02-28 2009-04-30 2
# ...
A <- read.table(
text = "
A.COL A.TIME
blue 2009-01-31
blue 2009-02-28
blue 2009-03-31
blue 2009-04-30
blue 2009-05-31
blue 2009-06-30
blue 2016-03-31
blue 2016-04-30
", header = T, stringsAsFactors = FALSE)
B <- read.table(
text = "
B.COL B.TIME
blue 2007-01-31
blue 2008-12-31
blue 2009-02-28
blue 2009-04-30
blue 2009-06-30
blue 2016-08-31
", stringsAsFactors = FALSE, header = T)