我正在处理一个spark数据帧(在scala中),我想要做的是按列分组并将不同的组转换为一系列数据帧。
所以它看起来像
df.groupyby("col").toSeq -> Seq[DataFrame]
更好的方法是将其变成带密钥对的东西
df.groupyby("col").toSeq -> Dict[key, DataFrame]
这似乎是一个显而易见的事情,但我似乎无法弄清楚它是如何起作用的
答案 0 :(得分:2)
这是你可以做的,这是一个简单的例子
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(29,"City 2", 72),
(28,"City 3", 48),
(28,"City 2", 19),
(27,"City 2", 16),
(28,"City 1", 84),
(28,"City 4", 72),
(29,"City 4", 39),
(27,"City 3", 42),
(26,"City 3", 68),
(27,"City 1", 89),
(27,"City 4", 104),
(26,"City 2", 19),
(29,"City 3", 27)
)).toDF("week", "city", "sale")
//create a dataframe with dummy data
//get list of cities
val city = data.select("city").distinct.collect().flatMap(_.toSeq)
// get all the columns for each city
//this returns Seq[(Any, Dataframe)] as (cityId, Dataframe)
val result = city.map(c => (c -> data.where(($"city" === c))))
//print all the dataframes
result.foreach(a=>
println(s"Dataframe with ${a._1}")
a._2.show()
})
输出看起来像这样
使用City 1的数据框
+----+------+----+
|week| city|sale|
+----+------+----+
| 28|City 1| 84|
| 27|City 1| 89|
+----+------+----+
City 3的数据框
+----+------+----+
|week| city|sale|
+----+------+----+
| 28|City 3| 48|
| 27|City 3| 42|
| 26|City 3| 68|
| 29|City 3| 27|
+----+------+----+
City 4的数据框
+----+------+----+
|week| city|sale|
+----+------+----+
| 28|City 4| 72|
| 29|City 4| 39|
| 27|City 4| 104|
+----+------+----+
City 2的数据框
+----+------+----+
|week| city|sale|
+----+------+----+
| 29|City 2| 72|
| 28|City 2| 19|
| 27|City 2| 16|
| 26|City 2| 19|
+----+------+----+
您还可以使用partitionby
对数据进行分组,并将其作为
dataframe.write.partitionBy("col").saveAsTable("outputpath")
这会为每个"col"
希望这有帮助!