我有如下数据:
CustomerId Category
100 2
100 2
100 3
100 6
100 4
200 3
200 6
200 7
300 2
所以我想要的输出是Jaccard Similarity Index:
我最初尝试的是找到联盟和条款的交集,但我不确定它是否是最有效的方式。另外,我想避免像Jaccard(100,300)和Jaccard(300,100)一起出现的重复。有人可以帮忙吗?
select t1.customer_id, t2.customer_id,
sum(case when t1.category_id = t2.category_id then 1 else 0 end) intersection,
sum(case when t1.category = t2.category then 1
when t1.category <> t2.category then 1 else 0 end)
union
from t t1 cross join
t t2
Where t1.customer_id <> t2.customer_id
group by t1.customer_id, t2.customer_id
不幸的是,我还检查过我有一位顾客在同一类别购买多件商品。因此,我编辑了表格,以反映客户100在类别2中有两个项目。但是,它不应该更改Jaccard相似性度量值。
答案 0 :(得分:1)
您不需要cross join
。通过计算一对的不同category_id的总和并从中减去相交的category_id来获得分母。
SELECT t1.customer_id AS id1,
t2.customer_id AS id2,
1.0*sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)
/ (count(DISTINCT t1.category_id)+count(DISTINCT t2.category_id)-sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)) AS jaccard_similarity
FROM t t1
JOIN t t2 ON t1.customer_id<t2.customer_id
GROUP BY t1.customer_id, t2.customer_id
如果join
不支持不等式,请使用
SELECT t1.customer_id AS id1,
t2.customer_id AS id2,
1.0*sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)
/ (count(DISTINCT t1.category_id)+count(DISTINCT t2.category_id)-sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)) AS jaccard_similarity
FROM t t1
CROSS JOIN t t2
WHERE t1.customer_id<t2.customer_id
GROUP BY t1.customer_id, t2.customer_id
如果你只需要成对的交叉点数,下面的查询就足够了。
select t1.customer_id as id1, t2.customer_id as id2
,sum(case when t1.category_id = t2.category_id then 1 else 0 end) as intersection
from t t1
join t t2 on t1.customer_id<t2.customer_id
group by t1.customer_id, t2.customer_id
编辑:根据OP的评论,客户可以多次拥有相同的类别,但只应计算一次。
SELECT t1.customer_id AS id1,
t2.customer_id AS id2,
1.0*COUNT(DISTINCT CASE WHEN t1.category_id = t2.category_id THEN t1.category_id END)
/ (COUNT(DISTINCT t1.category_id)+COUNT(DISTINCT t2.category_id)
-COUNT(DISTINCT CASE WHEN t1.category_id = t2.category_id THEN t1.category_id END)) AS jaccard_similarity
FROM t t1
CROSS JOIN t t2
WHERE t1.customer_id<t2.customer_id
GROUP BY t1.customer_id, t2.customer_id