Question

我对R很新，我一直想知道是否存在近似（dateTime）匹配的函数或包。函数 intersect（）提供完全匹配的列表，但我对近似匹配感兴趣。

e.g。我有两个带有dateTime值的数组，我想要一个两个数组中出现的出现列表，最大差异为2秒。

arrayA<-c("2000-12-31 10:00:00","2000-12-31 12:00:00")
arrayB<-c("2000-12-31 10:00:00","2000-12-31 12:00:01")
arrayA<-strptime(arrayA, "%Y-%m-%d %H:%M:%S", tz="UTC")
arrayB<-strptime(arrayB, "%Y-%m-%d %H:%M:%S", tz="UTC")

intersect(arrayA,arrayB) #returns "2000-12-31 10:00:00 UTC"

intersect（）只返回完全相同的值，但我想返回＆＃34; 2000-12-31 10:00:00 UTC＆＃34;和＆＃34; 2000-12-31 12:00:00 UTC＆＃34;。

所以基本上我的问题是你是否可以指定交叉匹配出现的程度。我的问题涉及日期，但数值可能会遇到同样的问题。我的数据集非常大，因此2个for循环往往需要很长时间才能进行手动匹配，并且交叉非常快。

Answer 1

checkSelfPermission()包提供了两种方法：data.table函数和非equi连接。这两种方法都需要将辅助列添加到数据

创建数据

foverlaps()

请注意，两个向量都是类arrayA <- anytime::utctime(c("2000-12-31 10:00:00", "2000-12-31 12:00:00", "2000-12-31 12:00:05", "2000-12-31 12:00:10"), tz = "UTC") arrayB <- anytime::utctime(c("2000-12-31 10:00:00", "2000-12-31 12:00:01", "2000-12-31 12:00:02", "2000-12-31 11:00:00"), tz = "UTC")，它比POSIXct函数创建的POSIXlt类更合适。此外，还添加了更多时间戳来测试不匹配。

准备数据

两种方法的数据准备相同：

strptime()

# make data.tables
library(data.table)   # version 1.10.4 used here
A <- data.table(arrayA)
B <- data.table(arrayB)

# define tolerance = 2 * tol_half
tol_half <- 1L # seconds

# add helper columns
A[, "copyA" := arrayA]
A
#                arrayA               copyA
#1: 2000-12-31 10:00:00 2000-12-31 10:00:00
#2: 2000-12-31 12:00:00 2000-12-31 12:00:00
#3: 2000-12-31 12:00:05 2000-12-31 12:00:05
#4: 2000-12-31 12:00:10 2000-12-31 12:00:10

B[, `:=`(start = arrayB - tol_half, end = arrayB + tol_half)]
B
#                arrayB               start                 end
#1: 2000-12-31 10:00:00 2000-12-31 09:59:59 2000-12-31 10:00:01
#2: 2000-12-31 12:00:01 2000-12-31 12:00:00 2000-12-31 12:00:02
#3: 2000-12-31 12:00:02 2000-12-31 12:00:01 2000-12-31 12:00:03
#4: 2000-12-31 11:00:00 2000-12-31 10:59:59 2000-12-31 11:00:01

中的

start和end表示B必须符合的可容忍时间范围才能被视为匹配。这类似于arrayA函数在fuzzyjoin solution中动态执行的操作。

`match_fun`

使用foverlaps()搜索foverlaps()和A中的重叠时间范围：

请注意，# setting keys is required by foverlap() setkey(A, arrayA, copyA) setkey(B, start, end) # find overlaps result <- foverlaps(B, A, nomatch = 0)[, c("copyA", "start", "end") := NULL][] result # arrayA arrayB #1: 2000-12-31 10:00:00 2000-12-31 10:00:00 #2: 2000-12-31 12:00:00 2000-12-31 12:00:01 immediatley会从[, c("copyA", "start", "end") := NULL][]的输出中删除辅助列。

非等联接

使用最新版本的foverlaps()，非等联接是可能的：

data.table

请注意，由于自动索引，非equi连接不需要事先设置键。

基准

待办事项：在大型用例中比较result <- A[B, .(arrayA, arrayB), on = c("copyA>=start", "copyA<=end"), nomatch = 0L] result # arrayA arrayB #1: 2000-12-31 10:00:00 2000-12-31 10:00:00 #2: 2000-12-31 12:00:00 2000-12-31 12:00:01，fuzzyjoin和非equi join 会很有趣。

Answer 2

library(lubridate)
library(fuzzyjoin)
arrayA<-c("2000-12-31 10:00:00","2000-12-31 12:00:00")
arrayB<-c("2000-12-31 10:00:00","2000-12-31 12:00:01")
arrayA <- strptime(arrayA, "%Y-%m-%d %H:%M:%S", tz = "UTC")
arrayB <- strptime(arrayB, "%Y-%m-%d %H:%M:%S", tz = "UTC")

# make data frames for join operations
A <- as.data.frame(arrayA)
B <- as.data.frame(arrayB)

# fuzzyjoin works by matching rows where a function applied
# to the column pairs is TRUE. Here the function is defined 
# inline, and uses lubridate durations.
fuzzy_join(A, B, 
           by=c("arrayA" = "arrayB"), 
           match_fun = function(x,y) {abs(x-y) <= duration(2, "seconds")})

# arrayA              arrayB
# 1 2000-12-31 10:00:00 2000-12-31 10:00:00
# 2 2000-12-31 12:00:00 2000-12-31 12:00:01

近似匹配

2 个答案:

创建数据

准备数据

match_fun

非等联接

基准

`match_fun`