我需要获取一个Map [String,DataFrame]并将其转换为数据集[Map [String,Array]]
val map_of_df = Map(
"df1"->sc.parallelize(1 to 4).map(i => (i,i*1000)).toDF("id","x").repartition(4)
,"df2"->sc.parallelize(1 to 4).map(i => (i,i*100)).toDF("id","y").repartition(4)
)
//map_of_df: scala.collection.immutable.Map[String,org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = Map(df1 -> [id: int, x: int], df2 -> [id: int, y: int])
//magic here, I need a type of org.apache.spark.sql.Dataset[Map[String, Array[org.apache.spark.sql.Row]]] with four partitions
//where the keys to the map are "df1" and "df2"
答案 0 :(得分:0)
您只需collect
所有数据框:
map_of_df
.mapValues(_.collect())
.toSeq
.toDS
请记住,这不会扩展-所有数据都将加载到驱动程序内存中。换句话说,您不需要Spark。