我有一个如下所示的示例表:
caller receiver
100 200
100 300
400 100
100 200
我需要找到每个号码的唯一连接数。 例如:100将有200,300和400的连接。
我的输出应该是:
100 3
200 1
300 1
400 1
我正在尝试使用配置单元。如果hive无法做到这一点,那么猪是否可以做到
答案 0 :(得分:3)
以下 的方式可以满足您的需求(虽然我并不完全相信它是最佳的,但我会留给您优化)。你需要this jar,它非常直接如何构建。
<强>查询:强>
add jar ./brickhouse-0.7.1.jar; -- name and path of yours will be different
create temporary function combine_unique as 'brickhouse.udf.collect.CombineUniqueUDAF';
select connection
, size(combine_unique(arr)) c
from (
select connection, arr
from (
select caller as connection
, collect_set(receiver) arr
from some_table
group by caller ) x
union all
select connection, arr
from (
select receiver as connection
, collect_set(caller) arr
from some_table
group by receiver ) y ) f
group by connection
<强>输出:强>
connection c
100 3
200 1
300 1
400 1
答案 1 :(得分:1)
这将解决您的问题。
select q1.caller,count(distinct(q1.receiver)) from
(select caller , receiver from test_1 group by caller,receiver
union all
select receiver as caller,caller as receiver from test_1 group by receiver,caller)q1 group by q1.caller;