假设我有以下Spark SQL数据框(即y_3
):
org.apache.spark.sql.DataFrame
我想将其转换为如下数据框:
type individual
=================
cat fritz
cat felix
mouse mickey
mouse minnie
rabbit bugs
duck donald
duck daffy
cat sylvester
我知道我必须做类似的事情:
type individuals
================================
cat [fritz, felix, sylvester]
mouse [mickey, minnie]
rabbit [bugs]
duck [donald, daffy]
什么是" ???"?这简单吗?或者像扩展 myDataFrame.groupBy("type").agg(???)
一样复杂?
答案 0 :(得分:1)
您可以使用collect_list
进行汇总,如下所示:
val df = Seq(
("cat", "fritz"),
("cat", "felix"),
("mouse", "mickey"),
("mouse", "minnie"),
("rabbit", "bugs"),
("duck", "donald"),
("duck", "daffy"),
("cat", "sylvester")
).toDF(
"type", "individual"
)
// Aggregate grouped individuals into arrays
val groupedDF = df.groupBy($"type").agg(collect_list($"individual").as("individuals"))
groupedDF.show(truncate=false)
+------+-------------------------+
|type |individuals |
+------+-------------------------+
|cat |[fritz, felix, sylvester]|
|duck |[donald, daffy] |
|rabbit|[bugs] |
|mouse |[mickey, minnie] |
+------+-------------------------+
答案 1 :(得分:0)
如果你不介意在里面使用一点hql,你可以去找collect_list函数https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions%28UDAF%29
例如:sparkContext.sql("select type, collect_list(individuals) as individuals from myDf group by type")
不确定你是否可以直接在火花中访问它。