Question

我试图找到同步数据条目，这些条目在一定数量的时间戳上共享一定的值（“ ref”）。

虚拟数据：

library(data.table)

dft <- data.table(
  id = rep(1:5, each=5),
  time = rep(1:5, 5),
  ref = c(10,11,11,11,11,
          10,11,11,11,21,
          20,31,31,31,31,
          20,41,41,41,41,
          20,51,51,51,51)
)

setorder(dft, time)
dft[, time := as.POSIXct(time, origin = "2018-10-14")]
dft

在该示例中，ID 1和2将在1,2,6,7,11,12,16,17行的4个时间戳上同步，因为它们共享相同的ref值（行被标记为与！）。注意：它们在一个时间戳中共享相同的参考值，并在下一时间戳中共享另一个参考值。

我该如何解决这个问题？我还想定义值必须相同的时间戳数量。如果我定义必须至少同步5个时间戳，则该示例中不应包含任何ID。如果ID为4或更低，则ID的1＆2应该显示为同步数据条目。

我必须在几百万行中进行该计算，所以我更喜欢使用data.table或dplyr解决方案或任何其他高性能解决方案（SQL也可以）。

    id                time ref
 1:  1 2018-10-14 02:00:01  10    !
 2:  2 2018-10-14 02:00:01  10    !
 3:  3 2018-10-14 02:00:01  20
 4:  4 2018-10-14 02:00:01  20
 5:  5 2018-10-14 02:00:01  20
 6:  1 2018-10-14 02:00:02  11    !
 7:  2 2018-10-14 02:00:02  11    !
 8:  3 2018-10-14 02:00:02  31
 9:  4 2018-10-14 02:00:02  41
10:  5 2018-10-14 02:00:02  51
11:  1 2018-10-14 02:00:03  11    !
12:  2 2018-10-14 02:00:03  11    !
13:  3 2018-10-14 02:00:03  31
14:  4 2018-10-14 02:00:03  41
15:  5 2018-10-14 02:00:03  51
16:  1 2018-10-14 02:00:04  11    !
17:  2 2018-10-14 02:00:04  11    !
18:  3 2018-10-14 02:00:04  31
19:  4 2018-10-14 02:00:04  41
20:  5 2018-10-14 02:00:04  51
21:  1 2018-10-14 02:00:05  11
22:  2 2018-10-14 02:00:05  21
23:  3 2018-10-14 02:00:05  31
24:  4 2018-10-14 02:00:05  41
25:  5 2018-10-14 02:00:05  51

对来自@DavidArenburg的两个示例进行基准测试：

library(microbenchmark)

mc = microbenchmark(times = 100,
  res1 = dft[dft, .(id, id2 = x.id), on = .(id > id, time, ref), nomatch = 0L, allow.cartesian=TRUE][, .N, by = .(id, id2)],
  res2= dft[dft, .(pmin(id, i.id), pmax(id, i.id)), on = .(time, ref), allow.cartesian=TRUE][V1 != V2, .(synced = .N / 2L), by = .(id1 = V1, id2 = V2)]
)

mc

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval cld
 res1 156.8389 158.8122 165.1828 159.6931 165.9156 292.7987   100  a 
 res2 311.1658 324.5684 350.3006 331.4310 343.6755 815.8397   100   b

Answer 1

可能的data.table解决方案

=COUNTIFS($AL:$AL,"No",$AA:$AA,"DG",$AM:$AM,I2)

说明

我们在dft[dft, .(id, id2 = x.id), # get the desired columns on = .(id > id, time, ref), # the join condition nomatch = 0L, # remove unmatched records (NAs) allow.cartesian = TRUE # In case of a big join, allow Cartesian join ][, .N, by = .(id, id2)] # Count obs. per ids combinations # id id2 N # 1: 1 2 4 # 2: 3 4 1 # 3: 3 5 1 # 4: 4 5 1和time上进行自我联接，同时指定ref，这样我们就不会联接到相同的id > id并提取联接的ID（{{ 1}}和id是来自两个数据集的联合ID，同时删除了所有不匹配的行（id）。最后，我们计算匹配的组合（x.id是data.table中的一个特殊符号，用于存储每个组合的对象数）。

旧的（还有更多涉及的解决方案）

nomatch = 0L

Answer 2

将@David Arenburgs代码转换为 SQL 给我：

SELECT a.id as id, b.id as id2, count(*) FROM testdata a
INNER JOIN testdata b ON a.ref = b.ref AND a.timest = b.timest
WHERE a.id > b.id 
GROUP BY a.id, b.id
ORDER BY count(*) DESC;

并仅选择计数> 1 的那些：

SELECT a.id as id, b.id as id2, count(*) FROM testdata a INNER JOIN testdata b ON a.ref = b.ref AND a.timest = b.timest WHERE a.id > b.id GROUP BY a.id, b.id HAVING count(*) > 1 ORDER BY count(*) DESC;

代码以产生带有问题的结果数据帧（dft）的SQL表：

R：

fwrite(x = dft, file = "C:/testdata.csv", row.names = F)

SQL：

CREATE TABLE testdata ( id serial NOT NULL, timest timestamp, ref integer ); COPY testdata(id, timest, ref) FROM 'C:/testdata.csv' DELIMITER ',' CSV;

如何根据时间戳和值查找同步ID

2 个答案: