我对R很新,我一直想知道是否存在近似(dateTime)匹配的函数或包。函数 intersect()提供完全匹配的列表,但我对近似匹配感兴趣。
e.g。我有两个带有dateTime值的数组,我想要一个两个数组中出现的出现列表,最大差异为2秒。
arrayA<-c("2000-12-31 10:00:00","2000-12-31 12:00:00")
arrayB<-c("2000-12-31 10:00:00","2000-12-31 12:00:01")
arrayA<-strptime(arrayA, "%Y-%m-%d %H:%M:%S", tz="UTC")
arrayB<-strptime(arrayB, "%Y-%m-%d %H:%M:%S", tz="UTC")
intersect(arrayA,arrayB) #returns "2000-12-31 10:00:00 UTC"
intersect()只返回完全相同的值,但我想返回&#34; 2000-12-31 10:00:00 UTC&#34;和&#34; 2000-12-31 12:00:00 UTC&#34;。
所以基本上我的问题是你是否可以指定交叉匹配出现的程度。我的问题涉及日期,但数值可能会遇到同样的问题。我的数据集非常大,因此2个for循环往往需要很长时间才能进行手动匹配,并且交叉非常快。
答案 0 :(得分:2)
checkSelfPermission()
包提供了两种方法:data.table
函数和非equi连接。这两种方法都需要将辅助列添加到数据
foverlaps()
请注意,两个向量都是类arrayA <- anytime::utctime(c("2000-12-31 10:00:00", "2000-12-31 12:00:00",
"2000-12-31 12:00:05", "2000-12-31 12:00:10"), tz = "UTC")
arrayB <- anytime::utctime(c("2000-12-31 10:00:00", "2000-12-31 12:00:01",
"2000-12-31 12:00:02", "2000-12-31 11:00:00"), tz = "UTC")
,它比POSIXct
函数创建的POSIXlt
类更合适。此外,还添加了更多时间戳来测试不匹配。
两种方法的数据准备相同:
strptime()
# make data.tables
library(data.table) # version 1.10.4 used here
A <- data.table(arrayA)
B <- data.table(arrayB)
# define tolerance = 2 * tol_half
tol_half <- 1L # seconds
# add helper columns
A[, "copyA" := arrayA]
A
# arrayA copyA
#1: 2000-12-31 10:00:00 2000-12-31 10:00:00
#2: 2000-12-31 12:00:00 2000-12-31 12:00:00
#3: 2000-12-31 12:00:05 2000-12-31 12:00:05
#4: 2000-12-31 12:00:10 2000-12-31 12:00:10
B[, `:=`(start = arrayB - tol_half, end = arrayB + tol_half)]
B
# arrayB start end
#1: 2000-12-31 10:00:00 2000-12-31 09:59:59 2000-12-31 10:00:01
#2: 2000-12-31 12:00:01 2000-12-31 12:00:00 2000-12-31 12:00:02
#3: 2000-12-31 12:00:02 2000-12-31 12:00:01 2000-12-31 12:00:03
#4: 2000-12-31 11:00:00 2000-12-31 10:59:59 2000-12-31 11:00:01
中的 start
和end
表示B
必须符合的可容忍时间范围才能被视为匹配。这类似于arrayA
函数在fuzzyjoin
solution中动态执行的操作。
match_fun
使用foverlaps()
搜索foverlaps()
和A
中的重叠时间范围:
B
请注意,# setting keys is required by foverlap()
setkey(A, arrayA, copyA)
setkey(B, start, end)
# find overlaps
result <- foverlaps(B, A, nomatch = 0)[, c("copyA", "start", "end") := NULL][]
result
# arrayA arrayB
#1: 2000-12-31 10:00:00 2000-12-31 10:00:00
#2: 2000-12-31 12:00:00 2000-12-31 12:00:01
immediatley会从[, c("copyA", "start", "end") := NULL][]
的输出中删除辅助列。
使用最新版本的foverlaps()
,非等联接是可能的:
data.table
请注意,由于自动索引,非equi连接不需要事先设置键。
待办事项:在大型用例中比较result <- A[B, .(arrayA, arrayB), on = c("copyA>=start", "copyA<=end"), nomatch = 0L]
result
# arrayA arrayB
#1: 2000-12-31 10:00:00 2000-12-31 10:00:00
#2: 2000-12-31 12:00:00 2000-12-31 12:00:01
,fuzzyjoin
和非equi join 会很有趣。
答案 1 :(得分:0)
library(lubridate)
library(fuzzyjoin)
arrayA<-c("2000-12-31 10:00:00","2000-12-31 12:00:00")
arrayB<-c("2000-12-31 10:00:00","2000-12-31 12:00:01")
arrayA <- strptime(arrayA, "%Y-%m-%d %H:%M:%S", tz = "UTC")
arrayB <- strptime(arrayB, "%Y-%m-%d %H:%M:%S", tz = "UTC")
# make data frames for join operations
A <- as.data.frame(arrayA)
B <- as.data.frame(arrayB)
# fuzzyjoin works by matching rows where a function applied
# to the column pairs is TRUE. Here the function is defined
# inline, and uses lubridate durations.
fuzzy_join(A, B,
by=c("arrayA" = "arrayB"),
match_fun = function(x,y) {abs(x-y) <= duration(2, "seconds")})
# arrayA arrayB
# 1 2000-12-31 10:00:00 2000-12-31 10:00:00
# 2 2000-12-31 12:00:00 2000-12-31 12:00:01