我正在尝试编写一个HIVE查询,该查询返回他们都喜欢的电影的a.user_id和b.user_id数量。当我运行查询时,我得到a.user_id,b.user_id,count和一组电影。我还得到了b.user_id,a.user_id,count和一组电影。
我的问题是如何将查询限制为仅限a.user_id,b.user_id计数和一组电影。
通过将第4行转到ON(a.movie_id = b.movie_id AND a.user_id < b.user_id)
SELECT a.user_id, b.user_id, count(*) AS num, collect_set(m.movie_title)
FROM ratings a
JOIN ratings b
ON (a.movie_id = b.movie_id)
JOIN movies m
ON (a.movie_id = m.movie_id AND b.movie_id = m.movie_id)
WHERE (a.user_id <> b.user_id)
GROUP BY a.user_id, b.user_id;
ORDER BY num DESC;
当前输出:
A,B,25,电影列表
B,A,25,电影列表
期望的输出:
A,B,25电影列表
答案 0 :(得分:1)
我希望你想要的查询:
SELECT a.user_id, b.user_id, count(*) AS num, collect_set(m.movie_title)
FROM ratings a JOIN
ratings b
ON a.movie_id = b.movie_id JOIN
movies m
ON a.movie_id = m.movie_id
WHERE a.user_id < b.user_id
GROUP BY a.user_id, b.user_id
ORDER BY num DESC;