如何根据时间戳和值查找同步ID

时间:2018-08-09 16:00:25

标签: sql r dplyr data.table

我试图找到同步数据条目,这些条目在一定数量的时间戳上共享一定的值(“ ref”)。

虚拟数据:

library(data.table)

dft <- data.table(
  id = rep(1:5, each=5),
  time = rep(1:5, 5),
  ref = c(10,11,11,11,11,
          10,11,11,11,21,
          20,31,31,31,31,
          20,41,41,41,41,
          20,51,51,51,51)
)

setorder(dft, time)
dft[, time := as.POSIXct(time, origin = "2018-10-14")]
dft

在该示例中,ID 1和2将在1,2,6,7,11,12,16,17行的4个时间戳上同步,因为它们共享相同的ref值(行被标记为与)。注意:它们在一个时间戳中共享相同的参考值,并在下一时间戳中共享另一个参考值。

我该如何解决这个问题?我还想定义值必须相同的时间戳数量。如果我定义必须至少同步5个时间戳,则该示例中不应包含任何ID。如果ID为4或更低,则ID的1&2应该显示为同步数据条目。

我必须在几百万行中进行该计算,所以我更喜欢使用data.tabledplyr解决方案或任何其他高性能解决方案(SQL也可以)。

    id                time ref
 1:  1 2018-10-14 02:00:01  10    !
 2:  2 2018-10-14 02:00:01  10    !
 3:  3 2018-10-14 02:00:01  20
 4:  4 2018-10-14 02:00:01  20
 5:  5 2018-10-14 02:00:01  20
 6:  1 2018-10-14 02:00:02  11    !
 7:  2 2018-10-14 02:00:02  11    !
 8:  3 2018-10-14 02:00:02  31
 9:  4 2018-10-14 02:00:02  41
10:  5 2018-10-14 02:00:02  51
11:  1 2018-10-14 02:00:03  11    !
12:  2 2018-10-14 02:00:03  11    !
13:  3 2018-10-14 02:00:03  31
14:  4 2018-10-14 02:00:03  41
15:  5 2018-10-14 02:00:03  51
16:  1 2018-10-14 02:00:04  11    !
17:  2 2018-10-14 02:00:04  11    !
18:  3 2018-10-14 02:00:04  31
19:  4 2018-10-14 02:00:04  41
20:  5 2018-10-14 02:00:04  51
21:  1 2018-10-14 02:00:05  11
22:  2 2018-10-14 02:00:05  21
23:  3 2018-10-14 02:00:05  31
24:  4 2018-10-14 02:00:05  41
25:  5 2018-10-14 02:00:05  51

对来自@DavidArenburg的两个示例进行基准测试:

library(microbenchmark)

mc = microbenchmark(times = 100,
  res1 = dft[dft, .(id, id2 = x.id), on = .(id > id, time, ref), nomatch = 0L, allow.cartesian=TRUE][, .N, by = .(id, id2)],
  res2= dft[dft, .(pmin(id, i.id), pmax(id, i.id)), on = .(time, ref), allow.cartesian=TRUE][V1 != V2, .(synced = .N / 2L), by = .(id1 = V1, id2 = V2)]
)

mc
Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval cld
 res1 156.8389 158.8122 165.1828 159.6931 165.9156 292.7987   100  a 
 res2 311.1658 324.5684 350.3006 331.4310 343.6755 815.8397   100   b

2 个答案:

答案 0 :(得分:4)

可能的data.table解决方案

=COUNTIFS($AL:$AL,"No",$AA:$AA,"DG",$AM:$AM,I2)

说明

我们在dft[dft, .(id, id2 = x.id), # get the desired columns on = .(id > id, time, ref), # the join condition nomatch = 0L, # remove unmatched records (NAs) allow.cartesian = TRUE # In case of a big join, allow Cartesian join ][, .N, by = .(id, id2)] # Count obs. per ids combinations # id id2 N # 1: 1 2 4 # 2: 3 4 1 # 3: 3 5 1 # 4: 4 5 1 time上进行自我联接,同时指定ref,这样我们就不会联接到相同的id > id并提取联接的ID({{ 1}}和id是来自两个数据集的联合ID,同时删除了所有不匹配的行(id)。最后,我们计算匹配的组合(x.id是data.table中的一个特殊符号,用于存储每个组合的对象数)。


旧的(还有更多涉及的解决方案)

nomatch = 0L

答案 1 :(得分:0)

将@David Arenburgs代码转换为 SQL 给我:

SELECT a.id as id, b.id as id2, count(*) FROM testdata a
INNER JOIN testdata b ON a.ref = b.ref AND a.timest = b.timest
WHERE a.id > b.id 
GROUP BY a.id, b.id
ORDER BY count(*) DESC; 

并仅选择计数> 1 的那些

SELECT a.id as id, b.id as id2, count(*) FROM testdata a
INNER JOIN testdata b ON a.ref = b.ref AND a.timest = b.timest
WHERE a.id > b.id 
GROUP BY a.id, b.id HAVING count(*) > 1
ORDER BY count(*) DESC; 

代码以产生带有问题的结果数据帧(dft)的SQL表:

R:

fwrite(x = dft, file = "C:/testdata.csv", row.names = F)

SQL:

CREATE TABLE testdata (
  id serial NOT NULL,
  timest timestamp,
  ref integer
  );

COPY testdata(id, timest, ref) 
FROM 'C:/testdata.csv' DELIMITER ',' CSV;