使用SparkSQL获取组中的前n行

时间:2016-12-01 20:51:29

标签: apache-spark apache-spark-sql

我的数据框如下:

Tags    Place       Count

Sales   New Jersey  200  
Sales   Hong Kong   200  
Sales  Florida     200  
Trade  New York    150  
Trade  San Jose    150  
Trade  New Jersey  150  
Market New Jersey   50  
Market Michigan     50  
Market Denver       50  

正如您所看到的,标签已按此表中的“计数”排序。我想从每个组中获得前n个标签,其中组是“标签”

让我说我先得到2然后结果数据框应该是这样的:

标记地点计数

Sales  New Jersey  200  
Sales  Hong Kong   200  
Trade  New York    150  
Trade  San Jose    150  
Market New Jersey   50  
Market Michigan     50  

如何在Spark SQL中执行此操作?

1 个答案:

答案 0 :(得分:1)

以下是答案

you need to create data frame using HiveContext 

 import org.apache.spark.sql.hive.HiveContext
val hivecontext = new HiveContext(sc)
val df= hivecontext.createDataFrame(data,schema)
df.registerTempTable("df")
 hivecontext.sql("SELECT tag,place,count FROM (SELECT tag,place,count,ROW_NUMBER() OVER (PARTITION BY tag) as rank FROM df) tmp WHERE rank <= 2 ORDER BY count DESC").show(false)

import org.apache.spark.sql.functions.row_number
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy(df("tag"))
val rankDesc = row_number().over(w).alias("rank")
df.select($"*", rankDesc).filter($"rank" <= 2 || $"rank" <= 2).orderBy($"count".desc).drop($"rank").show(false)

输出:

  +------+----------+-----+
|tag   |place     |count|
+------+----------+-----+
|Sales |Hong Kong |200  |
|Sales |New Jersey|200  |
|Trade |New York  |150  |
|Trade |San Jose  |150  |
|Market|New Jersey|50   |
|Market|Michigan  |50   |
+------+----------+-----+