我正在尝试计算两个表中的一个列的唯一值的两个表的INNER JOIN所产生的列之间的匹配。一个例子可以使事情更清楚:
如果我有以下两个表:
10000000000
我想找到与之关联的唯一Table A
-------
id_A: info_A
1 'a'
2 'b'
3 'c'
3 'd'
Table B
-------
id_B: info_B
1 'a'
3 'c'
5 'b'
:id_A
和[1,2,3]
:info_A
。
我想创建一个如下所示的表:
['a','b','c','d']
其中Table join of A+B
-----------------
id_A: info_A id_B info_B match_cnt
1 'a' 1 'a' 1
3 'c','d' 3 'c' 0.5
是给定match_cnt
的{{1}}和info_A
之间的匹配数。仅供参考,我正在使用的实际表格有数十亿行。
代码块展示了我尝试过的内容以及变体(未在下面显示):
info_B
答案 0 :(得分:0)
select id
,collect_list (case when a=1 then info end) as info_a
,collect_list (case when b=1 then info end) as info_b
,count (case when a=1 and b=1 then 1 end) / count(*) as match_cnt
from (select id
,info
,min (case when tab = 'A' then 1 end) as a
,min (case when tab = 'B' then 1 end) as b
from ( select 'A' as tab ,id_A as id ,info_A as info from A
union all select 'B' as tab ,id_B as id ,info_B as info from B
) t
group by id
,info
) t
group by id
having min(a) = 1
and min(b) = 1
;
+----+-----------+--------+-----------+
| id | info_a | info_b | match_cnt |
+----+-----------+--------+-----------+
| 1 | ["a"] | ["a"] | 1.0 |
| 3 | ["c","d"] | ["c"] | 0.5 |
+----+-----------+--------+-----------+
答案 1 :(得分:0)
您可以使用以下内容: -
WITH T1 AS ( select ID_A ,count(1) as cnt from tableA inner join tableB on tableA.ID_A=tableB.ID_B and tableA.INFO_A=tableB.INFO_B group by ID_A,INFO_A)
select distinct tmp.ID_A,tmp.a,tmp.ID_B,tmp.b, (cnt/size(a)) from
(select ID_A ,collect_set(INFO_A) as a,ID_B,collect_set(INFO_B) as b from tableA inner join tableB on tableA.a=tableB.a group by tableA.a,tableB.a)
tmp join T1 on T1.ID_A=tmp.ID_A