我试图找到同步数据条目,这些条目在一定数量的时间戳上共享一定的值(“ ref”)。
虚拟数据:
library(data.table)
dft <- data.table(
id = rep(1:5, each=5),
time = rep(1:5, 5),
ref = c(10,11,11,11,11,
10,11,11,11,21,
20,31,31,31,31,
20,41,41,41,41,
20,51,51,51,51)
)
setorder(dft, time)
dft[, time := as.POSIXct(time, origin = "2018-10-14")]
dft
在该示例中,ID 1和2将在1,2,6,7,11,12,16,17行的4个时间戳上同步,因为它们共享相同的ref
值(行被标记为与!)。注意:它们在一个时间戳中共享相同的参考值,并在下一时间戳中共享另一个参考值。
我该如何解决这个问题?我还想定义值必须相同的时间戳数量。如果我定义必须至少同步5个时间戳,则该示例中不应包含任何ID。如果ID为4或更低,则ID的1&2应该显示为同步数据条目。
我必须在几百万行中进行该计算,所以我更喜欢使用data.table
或dplyr
解决方案或任何其他高性能解决方案(SQL也可以)。
id time ref
1: 1 2018-10-14 02:00:01 10 !
2: 2 2018-10-14 02:00:01 10 !
3: 3 2018-10-14 02:00:01 20
4: 4 2018-10-14 02:00:01 20
5: 5 2018-10-14 02:00:01 20
6: 1 2018-10-14 02:00:02 11 !
7: 2 2018-10-14 02:00:02 11 !
8: 3 2018-10-14 02:00:02 31
9: 4 2018-10-14 02:00:02 41
10: 5 2018-10-14 02:00:02 51
11: 1 2018-10-14 02:00:03 11 !
12: 2 2018-10-14 02:00:03 11 !
13: 3 2018-10-14 02:00:03 31
14: 4 2018-10-14 02:00:03 41
15: 5 2018-10-14 02:00:03 51
16: 1 2018-10-14 02:00:04 11 !
17: 2 2018-10-14 02:00:04 11 !
18: 3 2018-10-14 02:00:04 31
19: 4 2018-10-14 02:00:04 41
20: 5 2018-10-14 02:00:04 51
21: 1 2018-10-14 02:00:05 11
22: 2 2018-10-14 02:00:05 21
23: 3 2018-10-14 02:00:05 31
24: 4 2018-10-14 02:00:05 41
25: 5 2018-10-14 02:00:05 51
对来自@DavidArenburg的两个示例进行基准测试:
library(microbenchmark)
mc = microbenchmark(times = 100,
res1 = dft[dft, .(id, id2 = x.id), on = .(id > id, time, ref), nomatch = 0L, allow.cartesian=TRUE][, .N, by = .(id, id2)],
res2= dft[dft, .(pmin(id, i.id), pmax(id, i.id)), on = .(time, ref), allow.cartesian=TRUE][V1 != V2, .(synced = .N / 2L), by = .(id1 = V1, id2 = V2)]
)
mc
Unit: milliseconds expr min lq mean median uq max neval cld res1 156.8389 158.8122 165.1828 159.6931 165.9156 292.7987 100 a res2 311.1658 324.5684 350.3006 331.4310 343.6755 815.8397 100 b
答案 0 :(得分:4)
可能的data.table解决方案
=COUNTIFS($AL:$AL,"No",$AA:$AA,"DG",$AM:$AM,I2)
说明
我们在dft[dft, .(id, id2 = x.id), # get the desired columns
on = .(id > id, time, ref), # the join condition
nomatch = 0L, # remove unmatched records (NAs)
allow.cartesian = TRUE # In case of a big join, allow Cartesian join
][, .N, by = .(id, id2)] # Count obs. per ids combinations
# id id2 N
# 1: 1 2 4
# 2: 3 4 1
# 3: 3 5 1
# 4: 4 5 1
和time
上进行自我联接,同时指定ref
,这样我们就不会联接到相同的id > id
并提取联接的ID({{ 1}}和id
是来自两个数据集的联合ID,同时删除了所有不匹配的行(id
)。最后,我们计算匹配的组合(x.id
是data.table中的一个特殊符号,用于存储每个组合的对象数)。
旧的(还有更多涉及的解决方案)
nomatch = 0L
答案 1 :(得分:0)
将@David Arenburgs代码转换为 SQL 给我:
SELECT a.id as id, b.id as id2, count(*) FROM testdata a
INNER JOIN testdata b ON a.ref = b.ref AND a.timest = b.timest
WHERE a.id > b.id
GROUP BY a.id, b.id
ORDER BY count(*) DESC;
并仅选择计数> 1 的那些:
SELECT a.id as id, b.id as id2, count(*) FROM testdata a
INNER JOIN testdata b ON a.ref = b.ref AND a.timest = b.timest
WHERE a.id > b.id
GROUP BY a.id, b.id HAVING count(*) > 1
ORDER BY count(*) DESC;
代码以产生带有问题的结果数据帧(dft
)的SQL表:
R:
fwrite(x = dft, file = "C:/testdata.csv", row.names = F)
SQL:
CREATE TABLE testdata (
id serial NOT NULL,
timest timestamp,
ref integer
);
COPY testdata(id, timest, ref)
FROM 'C:/testdata.csv' DELIMITER ',' CSV;