Question

我有两个表t1和t2，每个表都定义了从id到word的多重映射：

> select * from t1;
id word
1  foo
1  bar
2  baz
2  quux

和

> select * from t2;
id word
1  foo
1  baz
3  baz

我想要的是找出每个word的{{1}}集的并集和交集的大小：

id

显然，列不是独立的，例如

id  t1_union_t2 t1 t2 t2_minus_t1  t1_minus_t2 t1_intersect_t2
1   3           2  2  1            1           1
2   2           2  0  0            2           0
3   1           0  1  1            0           0

我希望所有这些仅用于一致性检查。

Answer 1

以下是我在SQL中处理此问题的方法：

select numtable1, numtable2, count(*) as numwords, min(id) as minid, max(id) as maxid
from (select id, word, sum(istable1) as numtable1, sum(istable2) as numtable2
      from ((select id, word, 1 as istable1, 0 as istable2
             from table1
            ) union all
            (select id, word, 0 as istable1, 1 as istable2
             from table2
            )
           ) t
      group by id, word
     ) t
group by numtable1, numtable2;

这标识了每个表中以及它们之间的重复项。

Hive支持from子句中的子查询，因此这可能也适用于Hive。

Answer 2

使用FULL JOIN

进行此操作的一种方法

SELECT COALESCE(t1.id, t2.id) id,
       COUNT(*) t1_union_t2,
       COUNT(t1.id) t1,
       COUNT(t2.id) t2,
       SUM(CASE WHEN t1.id IS NULL THEN 1 ELSE 0 END) t2_minus_t1,
       SUM(CASE WHEN t2.id IS NULL THEN 1 ELSE 0 END) t1_minus_t2,
       SUM(CASE WHEN t1.id = t2.id THEN 1 ELSE 0 END) t1_intersect_t2
  FROM t1 FULL JOIN t2
    ON t1.id = t2.id
   AND t1.word = t2.word
 GROUP BY COALESCE(t1.id, t2.id);

输出：

| ID | T1_UNION_T2 | T1 | T2 | T2_MINUS_T1 | T1_MINUS_T2 | T1_INTERSECT_T2 |
|----|-------------|----|----|-------------|-------------|-----------------|
|  1 |           3 |  2 |  2 |           1 |           1 |               1 |
|  2 |           2 |  2 |  0 |           0 |           2 |               0 |
|  3 |           1 |  0 |  1 |           1 |           0 |               0 |

这是 SQLFiddle 演示

在SQL中设置交集/差异？

2 个答案: