将Spark Dataframe转换为Scala Map集合

时间:2016-04-27 16:15:17

标签: apache-spark dataframe apache-spark-sql

我试图找到将整个Spark数据帧转换为scala Map集合的最佳解决方案。最好说明如下:

从这里开始(在Spark示例中):

val df = sqlContext.read.json("examples/src/main/resources/people.json")

df.show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

对于Scala集合(地图地图),如下所示:

val people = Map(
Map("age" -> null, "name" -> "Michael"),
Map("age" -> 30, "name" -> "Andy"),
Map("age" -> 19, "name" -> "Justin")
)

2 个答案:

答案 0 :(得分:11)

我认为你的问题没有意义 - 你最外面的Map,我只看到你试图将值填入其中 - 你需要在最外面的{{1}中设置键/值对}。话虽如此:

Map

会给你:

val peopleArray = df.collect.map(r => Map(df.columns.zip(r.toSeq):_*))

此时你可以这样做:

Array(
  Map("age" -> null, "name" -> "Michael"),
  Map("age" -> 30, "name" -> "Andy"),
  Map("age" -> 19, "name" -> "Justin")
)

哪会给你:

val people = Map(peopleArray.map(p => (p.getOrElse("name", null), p)):_*)

我猜这真的更像你想要的。如果您想在任意Map( ("Michael" -> Map("age" -> null, "name" -> "Michael")), ("Andy" -> Map("age" -> 30, "name" -> "Andy")), ("Justin" -> Map("age" -> 19, "name" -> "Justin")) ) 索引上键入它们,您可以执行以下操作:

Long

这给了你:

val indexedPeople = Map(peopleArray.zipWithIndex.map(r => (r._2, r._1)):_*)

答案 1 :(得分:1)

首先从Dataframe获取架构

select deckname, max(amount) from playertracker where pid = 1 group by deckname;

从数据框中获取rdd并使用它进行映射

val schemaList = dataframe.schema.map(_.name).zipWithIndex//get schema list from dataframe