聚合到列表中

时间:2017-06-30 14:19:47

标签: scala apache-spark-sql spark-dataframe

假设我有以下Spark SQL数据框(即y_3):

org.apache.spark.sql.DataFrame

我想将其转换为如下数据框:

 type   individual
 =================
 cat    fritz
 cat    felix
 mouse  mickey
 mouse  minnie
 rabbit bugs
 duck   donald
 duck   daffy
 cat    sylvester

我知道我必须做类似的事情:

 type   individuals
 ================================
 cat    [fritz, felix, sylvester]
 mouse  [mickey, minnie]
 rabbit [bugs]
 duck   [donald, daffy]

什么是" ???"?这简单吗?或者像扩展 myDataFrame.groupBy("type").agg(???) 一样复杂?

2 个答案:

答案 0 :(得分:1)

您可以使用collect_list进行汇总,如下所示:

val df = Seq(
  ("cat", "fritz"),
  ("cat", "felix"),
  ("mouse", "mickey"),
  ("mouse", "minnie"),
  ("rabbit", "bugs"),
  ("duck", "donald"),
  ("duck", "daffy"),
  ("cat", "sylvester")
).toDF(
  "type", "individual"
)

// Aggregate grouped individuals into arrays
val groupedDF = df.groupBy($"type").agg(collect_list($"individual").as("individuals"))

groupedDF.show(truncate=false)
+------+-------------------------+
|type  |individuals              |
+------+-------------------------+
|cat   |[fritz, felix, sylvester]|
|duck  |[donald, daffy]          |
|rabbit|[bugs]                   |
|mouse |[mickey, minnie]         |
+------+-------------------------+

答案 1 :(得分:0)

如果你不介意在里面使用一点hql,你可以去找collect_list函数https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions%28UDAF%29

例如:sparkContext.sql("select type, collect_list(individuals) as individuals from myDf group by type")

不确定你是否可以直接在火花中访问它。