Hive:在列的唯一值上计算INNER JOIN之间的匹配

时间:2017-03-03 10:53:34

标签: sql hive hiveql

我正在尝试计算两个表中的一个列的唯一值的两个表的INNER JOIN所产生的列之间的匹配。一个例子可以使事情更清楚:

如果我有以下两个表:

10000000000

我想找到与之关联的唯一Table A ------- id_A: info_A 1 'a' 2 'b' 3 'c' 3 'd' Table B ------- id_B: info_B 1 'a' 3 'c' 5 'b' id_A[1,2,3]info_A

我想创建一个如下所示的表:

['a','b','c','d']

其中Table join of A+B ----------------- id_A: info_A id_B info_B match_cnt 1 'a' 1 'a' 1 3 'c','d' 3 'c' 0.5 是给定match_cnt的{​​{1}}和info_A之间的匹配数。仅供参考,我正在使用的实际表格有数十亿行。

代码块展示了我尝试过的内容以及变体(未在下面显示):

info_B

2 个答案:

答案 0 :(得分:0)

select      id
           ,collect_list (case when a=1 then info end)                  as info_a
           ,collect_list (case when b=1 then info end)                  as info_b
           ,count        (case when a=1 and b=1 then 1 end) / count(*)  as match_cnt

from       (select      id
                       ,info
                       ,min (case when tab = 'A' then 1 end)    as a
                       ,min (case when tab = 'B' then 1 end)    as b

            from        (           select 'A' as tab ,id_A as id ,info_A as info from A
                        union all   select 'B' as tab ,id_B as id ,info_B as info from B
                        ) t

            group by    id
                       ,info
            ) t

group by    id

having      min(a) = 1
        and min(b) = 1
;
+----+-----------+--------+-----------+
| id |  info_a   | info_b | match_cnt |
+----+-----------+--------+-----------+
|  1 | ["a"]     | ["a"]  | 1.0       |
|  3 | ["c","d"] | ["c"]  | 0.5       |
+----+-----------+--------+-----------+

答案 1 :(得分:0)

您可以使用以下内容: -

 WITH T1 AS ( select ID_A ,count(1) as cnt from tableA inner join tableB on tableA.ID_A=tableB.ID_B and tableA.INFO_A=tableB.INFO_B  group by ID_A,INFO_A)

      select distinct tmp.ID_A,tmp.a,tmp.ID_B,tmp.b, (cnt/size(a)) from 

      (select ID_A ,collect_set(INFO_A) as a,ID_B,collect_set(INFO_B) as b from tableA inner join tableB on tableA.a=tableB.a group by tableA.a,tableB.a)

 tmp join T1 on T1.ID_A=tmp.ID_A