将Spark数据帧Groupby转换为一系列数据帧

时间:2017-07-18 04:47:40

标签: scala apache-spark apache-spark-sql

我正在处理一个spark数据帧(在scala中),我想要做的是按列分组并将不同的组转换为一系列数据帧。

所以它看起来像

df.groupyby("col").toSeq  -> Seq[DataFrame]

更好的方法是将其变成带密钥对的东西

df.groupyby("col").toSeq  -> Dict[key, DataFrame]

这似乎是一个显而易见的事情,但我似乎无法弄清楚它是如何起作用的

1 个答案:

答案 0 :(得分:2)

这是你可以做的,这是一个简单的例子

import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
  (29,"City 2", 72),
  (28,"City 3", 48),
  (28,"City 2", 19),
  (27,"City 2", 16),
  (28,"City 1", 84),
  (28,"City 4", 72),
  (29,"City 4", 39),
  (27,"City 3", 42),
  (26,"City 3", 68),
  (27,"City 1", 89),
  (27,"City 4", 104),
  (26,"City 2", 19),
  (29,"City 3", 27)
)).toDF("week", "city", "sale")
//create a dataframe with dummy data


//get list of cities 
val city = data.select("city").distinct.collect().flatMap(_.toSeq)

// get all the columns for each city
//this returns Seq[(Any, Dataframe)] as (cityId, Dataframe)
val result = city.map(c => (c -> data.where(($"city" === c))))

//print all the dataframes  
result.foreach(a=>
  println(s"Dataframe with ${a._1}")
  a._2.show()
})

输出看起来像这样

使用City 1的数据框

+----+------+----+
|week|  city|sale|
+----+------+----+
|  28|City 1|  84|
|  27|City 1|  89|
+----+------+----+

City 3的数据框

+----+------+----+
|week|  city|sale|
+----+------+----+
|  28|City 3|  48|
|  27|City 3|  42|
|  26|City 3|  68|
|  29|City 3|  27|
+----+------+----+

City 4的数据框

+----+------+----+
|week|  city|sale|
+----+------+----+
|  28|City 4|  72|
|  29|City 4|  39|
|  27|City 4| 104|
+----+------+----+

City 2的数据框

+----+------+----+
|week|  city|sale|
+----+------+----+
|  29|City 2|  72|
|  28|City 2|  19|
|  27|City 2|  16|
|  26|City 2|  19|
+----+------+----+

您还可以使用partitionby对数据进行分组,并将其作为

写入输出
dataframe.write.partitionBy("col").saveAsTable("outputpath")

这会为每个"col"

分组创建一个输出文件

希望这有帮助!