我有以下所有交易的数据,其中每个客户购买了多个类别的项目。我需要找到甚至不分享一个类别的客户对。
Customer_id category_id
21 3
21 5
31 4
31 1
24 3
24 6
22 6
22 5
我首先尝试使用collect_set,然后比较交叉连接中的集合,但我不知道hive中的任何此类函数。是否有可能以更简单的方式做到这一点?我上面的数据输出应该是(21,31),(31,24),(31,22)这些是不共享任何category_ids的对
SELECT
customer_id, COLLECT_LIST(category_id) AS aggr_set
FROM
tablename
GROUP BY
customer_id
答案 0 :(得分:0)
您可以使用cross join
然后聚合:
select t1.customer_id, t2.customer_id
from t t1 cross join
t t2
group by t1.customer_id, t2.customer_id
having sum(case when t1.category_id = t2.category_id then 1 else 0 end) = 0;
答案 1 :(得分:0)
使用self-join
获取客户对,并计算每对客户的不匹配数和总行数。如果它们相等,则意味着所有category_id的不匹配。
select c1,c2
from (
select t1.customer_id as c1,t2.customer_id as c2
,sum(case when t1.category_id=t2.category_id then 0 else 1 end) as mismatches
,count(*) as combinations
from tablename t1
join tablename t2 on t1.customer_id<t2.customer_id
group by t1.customer_id, t2.customer_id
) t
where combinations = mismatches