在SQL中设置交集/差异?

时间:2014-01-27 16:52:33

标签: sql set hive

我有两个表t1t2,每个表都定义了从idword的多重映射:

> select * from t1;
id word
1  foo
1  bar
2  baz
2  quux

> select * from t2;
id word
1  foo
1  baz
3  baz

我想要的是找出每个word的{​​{1}}集的并集和交集的大小:

id

显然,列不是独立的,例如

id  t1_union_t2 t1 t2 t2_minus_t1  t1_minus_t2 t1_intersect_t2
1   3           2  2  1            1           1
2   2           2  0  0            2           0
3   1           0  1  1            0           0

我希望所有这些仅用于一致性检查。

2 个答案:

答案 0 :(得分:2)

以下是我在SQL中处理此问题的方法:

select numtable1, numtable2, count(*) as numwords, min(id) as minid, max(id) as maxid
from (select id, word, sum(istable1) as numtable1, sum(istable2) as numtable2
      from ((select id, word, 1 as istable1, 0 as istable2
             from table1
            ) union all
            (select id, word, 0 as istable1, 1 as istable2
             from table2
            )
           ) t
      group by id, word
     ) t
group by numtable1, numtable2;

这标识了每个表中以及它们之间的重复项。

Hive支持from子句中的子查询,因此这可能也适用于Hive。

答案 1 :(得分:1)

使用FULL JOIN

进行此操作的一种方法
SELECT COALESCE(t1.id, t2.id) id,
       COUNT(*) t1_union_t2,
       COUNT(t1.id) t1,
       COUNT(t2.id) t2,
       SUM(CASE WHEN t1.id IS NULL THEN 1 ELSE 0 END) t2_minus_t1,
       SUM(CASE WHEN t2.id IS NULL THEN 1 ELSE 0 END) t1_minus_t2,
       SUM(CASE WHEN t1.id = t2.id THEN 1 ELSE 0 END) t1_intersect_t2
  FROM t1 FULL JOIN t2
    ON t1.id = t2.id
   AND t1.word = t2.word
 GROUP BY COALESCE(t1.id, t2.id);

输出:

| ID | T1_UNION_T2 | T1 | T2 | T2_MINUS_T1 | T1_MINUS_T2 | T1_INTERSECT_T2 |
|----|-------------|----|----|-------------|-------------|-----------------|
|  1 |           3 |  2 |  2 |           1 |           1 |               1 |
|  2 |           2 |  2 |  0 |           0 |           2 |               0 |
|  3 |           1 |  0 |  1 |           1 |           0 |               0 |

这是 SQLFiddle 演示