如何使用蜂巢/猪找到唯一连接的数量

时间:2015-06-10 00:42:09

标签: hadoop hive apache-pig

我有一个如下所示的示例表:

caller   receiver 
100         200
100         300
400         100
100         200

我需要找到每个号码的唯一连接数。 例如:100将有200,300和400的连接。

我的输出应该是:

100      3  
200      1  
300      1  
400      1

我正在尝试使用配置单元。如果hive无法做到这一点,那么猪是否可以做到

2 个答案:

答案 0 :(得分:3)

以下 的方式可以满足您的需求(虽然我并不完全相信它是最佳的,但我会留给您优化)。你需要this jar,它非常直接如何构建。

<强>查询:

add jar ./brickhouse-0.7.1.jar; -- name and path of yours will be different
create temporary function combine_unique as 'brickhouse.udf.collect.CombineUniqueUDAF';

select connection
  , size(combine_unique(arr)) c
from (
  select connection, arr
  from (
    select caller as connection
      , collect_set(receiver) arr
    from some_table
    group by caller ) x
  union all
  select connection, arr
  from (
    select receiver as connection
      , collect_set(caller) arr
    from some_table
    group by receiver ) y ) f
group by connection

<强>输出:

connection    c
100           3
200           1
300           1
400           1

答案 1 :(得分:1)

这将解决您的问题。

 select q1.caller,count(distinct(q1.receiver)) from 
(select caller , receiver from test_1 group by caller,receiver 
union all 
select receiver as caller,caller as receiver from test_1 group by receiver,caller)q1 group by q1.caller;