我有两个表t1
和t2
,每个表都定义了从id
到word
的多重映射:
> select * from t1;
id word
1 foo
1 bar
2 baz
2 quux
和
> select * from t2;
id word
1 foo
1 baz
3 baz
我想要的是找出每个word
的{{1}}集的并集和交集的大小:
id
显然,列不是独立的,例如
id t1_union_t2 t1 t2 t2_minus_t1 t1_minus_t2 t1_intersect_t2
1 3 2 2 1 1 1
2 2 2 0 0 2 0
3 1 0 1 1 0 0
我希望所有这些仅用于一致性检查。
答案 0 :(得分:2)
以下是我在SQL中处理此问题的方法:
select numtable1, numtable2, count(*) as numwords, min(id) as minid, max(id) as maxid
from (select id, word, sum(istable1) as numtable1, sum(istable2) as numtable2
from ((select id, word, 1 as istable1, 0 as istable2
from table1
) union all
(select id, word, 0 as istable1, 1 as istable2
from table2
)
) t
group by id, word
) t
group by numtable1, numtable2;
这标识了每个表中以及它们之间的重复项。
Hive支持from
子句中的子查询,因此这可能也适用于Hive。
答案 1 :(得分:1)
使用FULL JOIN
SELECT COALESCE(t1.id, t2.id) id,
COUNT(*) t1_union_t2,
COUNT(t1.id) t1,
COUNT(t2.id) t2,
SUM(CASE WHEN t1.id IS NULL THEN 1 ELSE 0 END) t2_minus_t1,
SUM(CASE WHEN t2.id IS NULL THEN 1 ELSE 0 END) t1_minus_t2,
SUM(CASE WHEN t1.id = t2.id THEN 1 ELSE 0 END) t1_intersect_t2
FROM t1 FULL JOIN t2
ON t1.id = t2.id
AND t1.word = t2.word
GROUP BY COALESCE(t1.id, t2.id);
输出:
| ID | T1_UNION_T2 | T1 | T2 | T2_MINUS_T1 | T1_MINUS_T2 | T1_INTERSECT_T2 | |----|-------------|----|----|-------------|-------------|-----------------| | 1 | 3 | 2 | 2 | 1 | 1 | 1 | | 2 | 2 | 2 | 0 | 0 | 2 | 0 | | 3 | 1 | 0 | 1 | 1 | 0 | 0 |
这是 SQLFiddle 演示