Question

我有以下（简化）表：

structure_id | hash_id
1              1
1              2
1              3
2              4
2              5
2              1
3              6
3              1
3              4

我希望得到重复哈希id的交集 - 这意味着以下示例：

因为1和1结构ID共享3条记录，1和2共享1条记录，等等.SQL结果将是：

id | intersected_id | count
1    1                3
1    2                1
1    3                1
2    1                1
2    2                3
2    3                2
3    1                1
3    2                2
3    3                3

值得一提的是，该表有大约500mln记录，因此查询必须尽可能优化。我怎么能这样做？

到目前为止我尝试的是自我加入：

SELECT t1.structure_id, COUNT(t1.hash_id) FROM table t1 INNER JOIN table t2 ON t1.structure_id != t2.strucutre_id AND t1.hash_id = t2.hash_id GROUP BY t1.structure_id;

但它没有正常工作 - 它在所有其他结构ID中找到重复的行。

Answer 1

您可以使用自我加入来执行此操作：

select t1.structure_id, t2.structure_id, count(*)
from test t1 join
     test t2
     on t1.hash_id = t2.hash_id
group by t1.structure_id, t2.structure_id;

Answer 2

这样可行，但我怀疑它能够满足您的需求。正如我在评论中所说，也许一个必要的程序会更适合这个问题。

SELECT id 
      ,intersected_id 
      ,COUNT(DISTINCT hash_id) AS `count`
FROM (
  SELECT t1.structure_id AS id 
        ,t2.structure_id AS intersected_id 
        ,t1.hash_id 
  FROM test AS t1
  INNER JOIN test AS t2
  ON t1.hash_id = t2.hash_id
) derived
GROUP BY id, intersected_id

SQL Fiddle

从表

2 个答案: