我有一个Spark Dataframe
Level Hierarchy Code -------------------------- Level1 Hier1 1 Level1 Hier2 2 Level1 Hier3 3 Level1 Hier4 4 Level1 Hier5 5 Level2 Hier1 1 Level2 Hier2 2 Level2 Hier3 3
我需要将它转换为Map变量,如Map [String,Map [Int,String]]
即
Map["Level1", Map[1->"Hier1", 2->"Hier2", 3->"Hier3", 4->"Hier4", 5->"Hier5"]]
Map["Level2", Map[1->"Hier1", 2->"Hier2", 3->"Hier3"]]
请建议一种合适的方法来实现此功能。
我的尝试。它有效但很难看
val level_code_df =master_df.select("Level","Hierarchy","Code").distinct()
val hierarchy_names = level_code_df.select("Level").distinct().collect()
val hierarchy_size = hierarchy_names.size
var hierarchyMap : scala.collection.mutable.Map[String, scala.collection.mutable.Map[Int,String]] = scala.collection.mutable.Map[String, scala.collection.mutable.Map[Int,String]]()
for(i <- 0 to hierarchy_size.toInt-1)
println("names:"+hierarchy_names(i)(0))
val name = hierarchy_names(i)(0).toString()
val code_level_map = level_code_df.rdd.map{row => {
if(name.equals(row.getAs[String]("Level"))){
Map(row.getAs[String]("Code").toInt -> row.getAs[String]("Hierarchy"))
} else
Map[Int, String]()
}}.reduce(_++_)
hierarchyMap = hierarchyMap + (name -> (collection.mutable.Map() ++ code_level_map))
}
}
答案 0 :(得分:4)
您需要使用跟随dataframe.groupByKey("level")
的{{1}}。不要忘记包括kryo地图编码器:
mapGroups
Spark 2.0+:
case class Data(level: String, hierarhy: String, code: Int)
val data = Seq(
Data("Level1","Hier1",1),
Data("Level1","Hier2",2),
Data("Level1","Hier3",3),
Data("Level1","Hier4",4),
Data("Level1","Hier5",5),
Data("Level2","Hier1",1),
Data("Level2","Hier2",2),
Data("Level2","Hier3",3)).toDS
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Map[Int, String]]]
Spark 1.6 +:
data.groupByKey(_.level).mapGroups{
case (level, values) => Map(level -> values.map(v => (v.code, v.hierarhy)).toMap)
}.collect()
//Array[Map[String,Map[Int,String]]] = Array(Map(Level1 -> Map(5 -> Hier5, 1 -> Hier1, 2 -> Hier2, 3 -> Hier3, 4 -> Hier4)), Map(Level2 -> Map(1 -> Hier1, 2 -> Hier2, 3 -> Hier3)))
答案 1 :(得分:0)
@ prudenko的回答可能是最简洁的 - 并且应该适用于Spark 1.6或更高版本。但是 - 如果您正在寻找一个与 DataFrames API(而不是数据集)保持一致的解决方案,那么可以使用简单的UDF:
val mapCombiner = udf[Map[Int, String], mutable.WrappedArray[Map[Int, String]]] {_.reduce(_ ++ _)}
val result: Map[String, Map[Int, String]] = df
.groupBy("Level")
.agg(collect_list(map($"Code", $"Hierarchy")) as "Maps")
.select($"Level", mapCombiner($"Maps") as "Combined")
.rdd.map(r => (r.getAs[String]("Level"), r.getAs[Map[Int, String]]("Combined")))
.collectAsMap()
注意如果单个密钥可能有数千个不同的值(Level
的值),那么这将表现不佳(或OOM),但是因为你收集所有这些无论如何,进入驱动程序内存,这可能不会是一个问题,或者您的要求无论如何都不会起作用。