如何将包含某些结构的数组的列拆分为单独的列?

时间:2019-02-12 10:32:04

标签: scala apache-spark dataframe

我有以下几种情况:

case class attribute(key:String,value:String)
case class entity(id:String,attr:List[attribute])


val entities = List(entity("1",List(attribute("name","sasha"),attribute("home","del"))),
entity("2",List(attribute("home","hyd"))))

val df = entities.toDF()

// df.show
+---+--------------------+
| id|                attr|
+---+--------------------+
|  1|[[name,sasha], [d...|
|  2|        [[home,hyd]]|
+---+--------------------+

//df.printSchema
root
 |-- id: string (nullable = true)
 |-- attr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
      |    |    |-- key: string (nullable = true)
      |    |    |-- value: string (nullable = true) 

我想生产的是

+---+--------------------+-------+
| id|  name              |  home |
+---+--------------------+-------+
|  1| sasha              |del    |
|  2| null               |hyd    |
+---+--------------------+-------+

我该如何处理。我在堆栈上看了很多类似的问题,但找不到有用的东西。

我的主要动机是对不同的属性进行groupBy,因此希望以上述格式使用它。

我研究了爆炸功能。它分解了单独行中的列表,我不希望这样。我想从attribute数组中创建更多列。

我发现的相似之处:

Spark - convert Map to a single-row DataFrame

Split 1 column into 3 columns in spark scala

Spark dataframe - Split struct column into 2 columns

1 个答案:

答案 0 :(得分:2)

可以轻松地将其减少为PySpark converting a column of type 'map' to multiple columns in a dataframeHow to get keys and values from MapType column in SparkSQL DataFrame。首先将attr转换为map<string, string>

import org.apache.spark.sql.functions.{explode, map_from_entries, map_keys}

val dfMap = df.withColumn("attr", map_from_entries($"attr"))

然后只需查找唯一键

val keys = dfMap.select(explode(map_keys($"attr"))).as[String].distinct.collect

然后从地图中选择

val result = dfMap.select($"id" +: keys.map(key => $"attr"(key) as key): _*)
result.show
+---+-----+----+
| id| name|home|
+---+-----+----+
|  1|sasha| del|
|  2| null| hyd|
+---+-----+----+

explodepivot的效率较低但更简洁的变体

val result = df
  .select($"id", explode(map_from_entries($"attr")))
  .groupBy($"id")
  .pivot($"key")
  .agg(first($"value"))

result.show
+---+----+-----+
| id|home| name|
+---+----+-----+
|  1| del|sasha|
|  2| hyd| null|
+---+----+-----+

但实际上我不建议这样做。