我有3个非常大的IP地址表*,我试图计算3个表之间的公共IP数。我已经考虑使用连接和子查询来查找这3个表之间的IP交集。如何通过一个查询找到所有3个表的交集?
这是不正确的语法,但说明了我正在努力实现的目标:
SELECT COUNT(DISTINCT(a.ip)) FROM a, b, c WHERE a.ip = b.ip = c.ip
我已经看到了关于如何加入3个表的其他答案,但是Hive没有任何内容,也没有任何关于此规模的内容。
*注意:
答案 0 :(得分:3)
正确的语法是:
SELECT COUNT(DISTINCT a.ip)
FROM a JOIN
b
ON a.ip = b.ip JOIN
c
ON a.ip = c.ip;
这可能不会在我们有生之年完成。更好的方法是:
select ip
from (select distinct a.ip, 1 as which from a union all
select distinct b.ip, 2 as which from b union all
select distinct c.ip, 3 as which from c
) abc
group by ip
having sum(which) = 6;
承认,sum(which) = 6
只是说三者都存在。由于子查询中的select distinct
,您可以这样做:
having count(*) = 3
答案 1 :(得分:1)
一个简单的解决方案:
select count(*)
from (select 1
from (
select 'a' as tab,ip from a
union all select 'b' as tab,ip from b
union all select 'c' as tab,ip from c
) t
group by ip
having count(case when tab = 'a' then 1 end) > 0
and count(case when tab = 'b' then 1 end) > 0
and count(case when tab = 'c' then 1 end) > 0
) t
这将不仅为您提供有关3个交叉点(in_a = 1,in_b = 1,in_c = 1)的信息,还提供有关所有其他组合的信息:
select in_a
,in_b
,in_c
,count(*) as ips
from (select max(case when tab = 'a' then 1 end) as in_a
,max(case when tab = 'b' then 1 end) as in_b
,max(case when tab = 'c' then 1 end) as in_c
from (
select 'a' as tab,ip from a
union all select 'b' as tab,ip from b
union all select 'c' as tab,ip from c
) t
group by ip
) t
group by in_a
,in_b
,in_c
......甚至还有更多信息:
select sign(cnt_a) as in_a
,sign(cnt_b) as in_b
,sign(cnt_c) as in_c
,count(*) as unique_ips
,sum(cnt_total) as total_ips
,sum(cnt_a) as total_ips_in_a
,sum(cnt_b) as total_ips_in_b
,sum(cnt_c) as total_ips_in_c
from (select count(*) as cnt_total
,count(case when tab = 'a' then 1 end) as cnt_a
,count(case when tab = 'b' then 1 end) as cnt_b
,count(case when tab = 'c' then 1 end) as cnt_c
from (
select 'a' as tab,ip from a
union all select 'b' as tab,ip from b
union all select 'c' as tab,ip from c
) t
group by ip
) t
group by sign(cnt_a)
,sign(cnt_b)
,sign(cnt_c)