我想知道是否有某种方法可以为Spark数据帧指定自定义聚合函数。如果我有一个包含2列id
和value
的表格,我想将id
分组并将值汇总到每个value
的列表中,如下所示:
从:
john | tomato
john | carrot
bill | apple
john | banana
bill | taco
为:
john | tomato, carrot, banana
bill | apple, taco
这在数据帧中是否可行?我问的是数据帧,因为我正在将数据作为一个orc文件读取,并将其作为数据帧加载。我认为将它转换为RDD是无效的。
答案 0 :(得分:7)
我只想简单地使用以下内容:
import org.apache.spark.sql.functions.collect_list
val df = Seq(("john", "tomato"), ("john", "carrot"),
("bill", "apple"), ("john", "banana"),
("bill", "taco")).toDF("id", "value")
// df: org.apache.spark.sql.DataFrame = [id: string, value: string]
val aggDf = df.groupBy($"id").agg(collect_list($"value").as("values"))
// aggDf: org.apache.spark.sql.DataFrame = [id: string, values: array<string>]
aggDf.show(false)
// +----+------------------------+
// |id |values |
// +----+------------------------+
// |john|[tomato, carrot, banana]|
// |bill|[apple, taco] |
// +----+------------------------+
您甚至不需要调用基础rdd
。
答案 1 :(得分:2)
恢复RDD
操作往往最适合这样的问题:
scala> val df = sc.parallelize(Seq(("john", "tomato"),
("john", "carrot"), ("bill", "apple"),
("john", "bannana"), ("bill", "taco")))
.toDF("name", "food")
df: org.apache.spark.sql.DataFrame = [name: string, food: string]
scala> df.show
+----+-------+
|name| food|
+----+-------+
|john| tomato|
|john| carrot|
|bill| apple|
|john|bannana|
|bill| taco|
+----+-------+
scala> val aggregated = df.rdd
.map{ case Row(k: String, v: String) => (k, List(v)) }
.reduceByKey{_ ++ _}
.toDF("name", "foods")
aggregated: org.apache.spark.sql.DataFrame = [name: string, foods: array<string>]
scala> aggregated.collect.foreach{println}
[john,WrappedArray(tomato, carrot, bannana)]
[bill,WrappedArray(apple, taco)]
至于效率方面,我认为DataFrames
是RDD
,因此像.rdd
这样的转换费用非常低。