在SparkSQL中为每个城市选择前10项

时间:2016-05-26 15:32:26

标签: sql apache-spark

我有下表SQL表(SparkSQL)。

user_id, city, timestamp, item_id

我需要在每个给定日期找到给定城市的前10个项目(就item_id在该城市中出现的时间而言)。

然后我做了以下事情:

SELECT   * 
FROM     ( 
                SELECT *, 
                       row_number() OVER partition BY city AS rn 
                FROM   mytable) AS foo 
ORDER BY rn DESC

然而,虽然它按照rn排序,但它并没有给我一个给定日期的前10个元素。什么是解决这个问题的正确方法?谢谢!

1 个答案:

答案 0 :(得分:2)

不知道从火花时间戳开始的TRUNC时间函数是什么。

但首先你需要计算计数,然后是row_number

SELECT *
FROM (
        SELECT   city, item_id, theDATE, cnt,
                 ROW_NUMBER() OVER (PARTITION BY city, theDATE
                                    ORDER BY cnt) rn             
        FROM     (SELECT city,
                         timestamp,
                         item_id,
                         to_date(timestamp) as theDATE, -- remove time and leave just date.
                         COUNT(item_id) OVER (PARTITION BY city,  to_date(timestamp)) cnt
                  FROM   mytable
                 ) AS foo 
     ) AS boo
WHERE rn <= 10
ORDER BY city, theDATE, rn