Spark groupby过滤排序,每个城市都有前3个阅读文章

时间:2017-01-05 01:06:58

标签: scala apache-spark apache-spark-sql spark-dataframe

我有一个表格数据如下:

+-----------+--------+-------------+
| City Name |  URL   |  Read Count |
+-----------+--------+-------------+
| Gurgaon   |  URL1  |  3          |
| Gurgaon   |  URL3  |  6          |
| Gurgaon   |  URL6  |  5          |
| Gurgaon   |  URL4  |  1          |
| Gurgaon   |  URL5  |  5          |
| Delhi     |  URL3  |  4          |
| Delhi     |  URL7  |  2          |
| Delhi     |  URL5  |  1          |
| Delhi     |  URL6  |  6          |
| Punjab    |  URL6  |  5          |
| Punjab    |  URL4  |  1          |
| Mumbai    |  URL5  |  5          |
+-----------+--------+-------------+

我想看到类似的事情 - >前三名阅读文章(如果存在)每个城市

+-----------+--------+--------+
| City Name |  URL   |  Count |
+-----------+--------+--------+
| Gurgaon   |  URL3  |      6 |
| Gurgaon   |  URL6  |      5 |
| Gurgaon   |  URL5  |      5 |
| Delhi     |  URL6  |      6 |
| Delhi     |  URL3  |      4 |
| Delhi     |  URL1  |      3 |
| Punjab    |  URL6  |      5 |
| Punjab    |  URL4  |      1 |
| Mumbai    |  URL5  |      5 |
+-----------+--------+--------+

我正在使用Spark 2.0.2,Scala 2.11.8

2 个答案:

答案 0 :(得分:2)

您可以使用窗口功能来获取输出。

import org.apache.spark.sql.expressions.Window

val df = sc.parallelize(Seq(
  ("Gurgaon","URL1",3), ("Gurgaon","URL3",6), ("Gurgaon","URL6",5), ("Gurgaon","URL4",1),("Gurgaon","URL5",5)
  ("DELHI","URL3",4), ("DELHI","URL7",2), ("DELHI","URL5",1), ("DELHI","URL6",6),("Mumbai","URL5",5)
  ("Punjab","URL6",6), ("Punjab","URL4",1))).toDF("City", "URL", "Count")

df.show()

+-------+----+-----+
|   City| URL|Count|
+-------+----+-----+
|Gurgaon|URL1|    3|
|Gurgaon|URL3|    6|
|Gurgaon|URL6|    5|
|Gurgaon|URL4|    1|
|Gurgaon|URL5|    5|
|  DELHI|URL3|    4|
|  DELHI|URL7|    2|
|  DELHI|URL5|    1|
|  DELHI|URL6|    6|
| Mumbai|URL5|    5|
| Punjab|URL6|    6|
| Punjab|URL4|    1|
+-------+----+-----+

val w = Window.partitionBy($"City").orderBy($"Count".desc)
val dfTop = df.withColumn("row", rowNumber.over(w)).where($"row" <= 3).drop("row")

dfTop.show

+-------+----+-----+
|   City| URL|Count|
+-------+----+-----+
|Gurgaon|URL3|    6|
|Gurgaon|URL6|    5|
|Gurgaon|URL5|    5|
| Mumbai|URL5|    5|
|  DELHI|URL6|    6|
|  DELHI|URL3|    4|
|  DELHI|URL7|    2|
| Punjab|URL6|    6|
| Punjab|URL4|    1|
+-------+----+-----+

在Spark 1.6.2上测试输出

答案 1 :(得分:1)

窗口函数可能是要走的路,并且有一个内置函数用于此目的:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{rank, desc}

val window = Window.partitionBy($"City").orderBy(desc("Count"))
val dfTop = df.withColumn("rank", rank.over(window)).where($"rank" <= 3)