如何使用Hive查询3个大表来交叉值?

时间:2017-07-28 18:24:26

标签: sql hadoop apache-spark hive

我有3个非常大的IP地址表*,我试图计算3个表之间的公共IP数。我已经考虑使用连接和子查询来查找这3个表之间的IP交集。如何通过一个查询找到所有3个表的交集?

这是不正确的语法,但说明了我正在努力实现的目标:

SELECT COUNT(DISTINCT(a.ip)) FROM a, b, c WHERE a.ip = b.ip = c.ip

我已经看到了关于如何加入3个表的其他答案,但是Hive没有任何内容,也没有任何关于此规模的内容。

*注意:

  • 表a:7亿行
  • 表b:18亿行
  • 表c:168万行
  • 'Tables'实际上是由S3支持的hive Metastore。
  • 每个表中有许多重复的IP
  • 表现建议欢迎。
  • 如果使用它而不是Hive,也可以运行Spark SQL查询。

2 个答案:

答案 0 :(得分:3)

正确的语法是:

SELECT COUNT(DISTINCT a.ip)
FROM a JOIN
     b
     ON a.ip = b.ip JOIN
     c
     ON a.ip  = c.ip;

这可能不会在我们有生之年完成。更好的方法是:

select ip
from (select distinct a.ip, 1 as which from a union all
      select distinct b.ip, 2 as which from b union all
      select distinct c.ip, 3 as which from c
     ) abc
group by ip
having sum(which) = 6;

承认,sum(which) = 6只是说三者都存在。由于子查询中的select distinct,您可以这样做:

having count(*) = 3

答案 1 :(得分:1)

一个简单的解决方案:

select      count(*)

from       (select      1

            from        (
                                    select 'a' as tab,ip from a
                        union all   select 'b' as tab,ip from b
                        union all   select 'c' as tab,ip from c
                        ) t

            group by    ip

            having      count(case when tab = 'a' then 1 end) > 0
                    and count(case when tab = 'b' then 1 end) > 0
                    and count(case when tab = 'c' then 1 end) > 0

            ) t

这将不仅为您提供有关3个交叉点(in_a = 1,in_b = 1,in_c = 1)的信息,还提供有关所有其他组合的信息:

select      in_a
           ,in_b
           ,in_c
           ,count(*)    as ips

from       (select      max(case when tab = 'a' then 1 end)  as in_a
                       ,max(case when tab = 'b' then 1 end)  as in_b
                       ,max(case when tab = 'c' then 1 end)  as in_c

            from        (
                                    select 'a' as tab,ip from a
                        union all   select 'b' as tab,ip from b
                        union all   select 'c' as tab,ip from c
                        ) t

            group by    ip
            ) t

group by    in_a
           ,in_b
           ,in_c

......甚至还有更多信息:

select      sign(cnt_a)                 as in_a
           ,sign(cnt_b)                 as in_b
           ,sign(cnt_c)                 as in_c

           ,count(*)                    as unique_ips
           ,sum(cnt_total)              as total_ips
           ,sum(cnt_a)                  as total_ips_in_a
           ,sum(cnt_b)                  as total_ips_in_b
           ,sum(cnt_c)                  as total_ips_in_c

from       (select      count(*)                                as cnt_total
                       ,count(case when tab = 'a' then 1 end)   as cnt_a
                       ,count(case when tab = 'b' then 1 end)   as cnt_b
                       ,count(case when tab = 'c' then 1 end)   as cnt_c

            from        (
                                    select 'a' as tab,ip from a
                        union all   select 'b' as tab,ip from b
                        union all   select 'c' as tab,ip from c
                        ) t

            group by    ip
            ) t

group by    sign(cnt_a)
           ,sign(cnt_b)
           ,sign(cnt_c)