Spark SQL查询分组依据值,后跟列表

时间:2018-11-23 11:01:45

标签: apache-spark apache-spark-sql

具有如下表(Data):

color status  freq

red    y        1

blue   y        1

green   y       2

预期输出:red,blue 1 green 2

select color , freq from  data where status = 'y' group by(freq)

现在,我们要为red,blue得到的结果为'freq= 1,对于green得到的结果为freq =2  如何获取按频率分组的颜色列表,请更正上述sql查询。

使用first(colour)时,它仅返回第一种颜色,但期望所有颜色按频率分组。

根据输出更正SQL查询

1 个答案:

答案 0 :(得分:0)

尝试一下:

import org.apache.spark.sql.functions._
import spark.implicits._
//import org.apache.spark.sql._
//import org.apache.spark.sql.types._ 

val df = Seq(
 ("green","y", 4),
 ("blue","n", 7),
 ("red","y", 7),
 ("yellow","y", 7),
 ("cyan","y", 7)
          ).toDF("colour", "status", "freq")

val df2 = df.where("status = 'y'") 
            .select($"freq", $"colour")
            .groupBy("freq")
            .agg(collect_list($"colour"))

df2.show(false)

返回:

+----+--------------------+
|freq|collect_list(colour)|
+----+--------------------+
|4   |[green]             |
|7   |[red, yellow, cyan] |
+----+--------------------+