所以问题的基本前提是我在hadoop中有一些巨大的表格,我需要从每个月获得一些样本。我嘲笑了下面的内容,以显示我所追求的那种东西,但显然这不是真正的数据......
--Create the table
CREATE TABLE exp_dqss_team.testranking (
Name STRING,
Age INT,
Favourite_Cheese STRING
) STORED AS PARQUET;
--Put some data in
INSERT INTO TABLE exp_dqss_team.testranking
VALUES (
('Tim', 33, 'Cheddar'),
('Martin', 49, 'Gorgonzola'),
('Will', 39, 'Brie'),
('Bob', 63, 'Cheddar'),
('Bill', 35, 'Brie'),
('Ben', 42, 'Gorgonzola'),
('Duncan', 55, 'Brie'),
('Dudley', 28, 'Cheddar'),
('Edmund', 27, 'Brie'),
('Baldrick', 29, 'Gorgonzola'));
我想得的是每个奶酪类别中最年轻的2个人。下面给出了每种奶酪类别的年龄排名,但不会将其限制在前两位:
SELECT RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) AS rank_my_cheese, favourite_cheese, name, age
FROM exp_dqss_team.testranking;
如果我添加WHERE
子句,则会出现以下错误:
WHERE子句不得包含分析表达式
SELECT RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) AS rank_my_cheese, favourite_cheese, name, age
FROM exp_dqss_team.testranking
WHERE RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) <3;
有没有更好的方法来创建一个包含所有排名的表格,然后从排名中选择WHERE
条款?
答案 0 :(得分:1)
你可以尝试一下吗?
select * from (
SELECT RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) AS rank_my_cheese, favourite_cheese, name, age
FROM exp_dqss_team.testranking
) as temp
where rank_my_cheese <= 2;