我有以下几种情况:
case class attribute(key:String,value:String)
case class entity(id:String,attr:List[attribute])
val entities = List(entity("1",List(attribute("name","sasha"),attribute("home","del"))),
entity("2",List(attribute("home","hyd"))))
val df = entities.toDF()
// df.show
+---+--------------------+
| id| attr|
+---+--------------------+
| 1|[[name,sasha], [d...|
| 2| [[home,hyd]]|
+---+--------------------+
//df.printSchema
root
|-- id: string (nullable = true)
|-- attr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
我想生产的是
+---+--------------------+-------+
| id| name | home |
+---+--------------------+-------+
| 1| sasha |del |
| 2| null |hyd |
+---+--------------------+-------+
我该如何处理。我在堆栈上看了很多类似的问题,但找不到有用的东西。
我的主要动机是对不同的属性进行groupBy,因此希望以上述格式使用它。
我研究了爆炸功能。它分解了单独行中的列表,我不希望这样。我想从attribute
数组中创建更多列。
我发现的相似之处:
Spark - convert Map to a single-row DataFrame
答案 0 :(得分:2)
可以轻松地将其减少为PySpark converting a column of type 'map' to multiple columns in a dataframe或How to get keys and values from MapType column in SparkSQL DataFrame。首先将attr
转换为map<string, string>
import org.apache.spark.sql.functions.{explode, map_from_entries, map_keys}
val dfMap = df.withColumn("attr", map_from_entries($"attr"))
然后只需查找唯一键
val keys = dfMap.select(explode(map_keys($"attr"))).as[String].distinct.collect
然后从地图中选择
val result = dfMap.select($"id" +: keys.map(key => $"attr"(key) as key): _*)
result.show
+---+-----+----+
| id| name|home|
+---+-----+----+
| 1|sasha| del|
| 2| null| hyd|
+---+-----+----+
explode
和pivot
的效率较低但更简洁的变体
val result = df
.select($"id", explode(map_from_entries($"attr")))
.groupBy($"id")
.pivot($"key")
.agg(first($"value"))
result.show
+---+----+-----+
| id|home| name|
+---+----+-----+
| 1| del|sasha|
| 2| hyd| null|
+---+----+-----+
但实际上我不建议这样做。