Spark:如何将结构类型分成多列?

时间:2018-09-02 23:36:13

标签: scala apache-spark apache-spark-sql

我知道这个问题在Stack Overflow上已经问过很多遍了,并且在大多数帖子中都得到了令人满意的回答,但是我不确定这是否是我的最佳方法。 我有一个嵌入了几种结构类型的数据集:

root
 |-- STRUCT1: struct (nullable = true)
 |    |-- FIELD_1: string (nullable = true)
 |    |-- FIELD_2: long (nullable = true)
 |    |-- FIELD_3: integer (nullable = true)
 |-- STRUCT2: struct (nullable = true)
 |    |-- FIELD_4: string (nullable = true)
 |    |-- FIELD_5: long (nullable = true)
 |    |-- FIELD_6: integer (nullable = true)
 |-- STRUCT3: struct (nullable = true)
 |    |-- FIELD_7: string (nullable = true)
 |    |-- FIELD_8: long (nullable = true)
 |    |-- FIELD_9: integer (nullable = true)
 |-- ARRAYSTRUCT4: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- FIELD_10: integer (nullable = true)
 |    |    |-- FIELD_11: integer (nullable = true)

+-------+------------+------------+------------------+
|STRUCT1| STRUCT2    | STRUCT3    | ARRAYSTRUCT4     |
+-------+------------+------------+------------------+
|[1,2,3]|[aa, xx, yy]|[p1, q2, r3]|[[1a, 2b],[3c,4d]]|
+-------+------------+------------+------------------+

我想将其转换为:

1.将结构扩展为列的数据集。
2.数组(ARRAYSTRUCT4)分解为行的数据集。

root
 |-- FIELD_1: string (nullable = true)
 |-- FIELD_2: long (nullable = true)
 |-- FIELD_3: integer (nullable = true)
 |-- FIELD_4: string (nullable = true)
 |-- FIELD_5: long (nullable = true)
 |-- FIELD_6: integer (nullable = true)
 |-- FIELD_7: string (nullable = true)
 |-- FIELD_8: long (nullable = true)
 |-- FIELD_9: integer (nullable = true)
 |-- FIELD_10: integer (nullable = true)
 |-- FIELD_11: integer (nullable = true)

+-------+------------+------------+---------+     ---------+----------+
|FIELD_1| FIELD_2    | FIELD_3    | FIELD_4 |     |FIELD_10| FIELD_11 |
+-------+------------+------------+---------+ ... ---------+----------+
|1      |2           |3           |  aa     |     |  1a    |  2b      |
+-------+------------+------------+-----------------------------------+

要实现这一目标,我可以使用:

val expanded = df.select("STRUCT1.*", "STRUCT2.*", "STRUCT3.*", "STRUCT4")

后跟爆炸:

val exploded = expanded.select(explode(expanded("STRUCT4")))

但是,我想知道是否还有一种更实用的方法来做到这一点,尤其是选择。我可以按以下方式使用withColumn

data.withColumn("FIELD_1", $"STRUCT1".getItem(0))
      .withColumn("FIELD_2", $"STRUCT1".getItem(1))
      .....

但是我有80多个专栏。有没有更好的方法可以做到这一点?

1 个答案:

答案 0 :(得分:2)

您可以先通过list("67")使所有列['6', '7']成为类型,然后通过struct将任何explode列放入Array(struct)列中,然后使用{{1 }},将每个struct列名插值到foldLeft中,如下所示:

map

请注意,为简单起见,假设您的DataFrame仅具有structcol.*类型的列。如果还有其他数据类型,只需将过滤条件应用于import org.apache.spark.sql.functions._ case class S1(FIELD_1: String, FIELD_2: Long, FIELD_3: Int) case class S2(FIELD_4: String, FIELD_5: Long, FIELD_6: Int) case class S3(FIELD_7: String, FIELD_8: Long, FIELD_9: Int) case class S4(FIELD_10: Int, FIELD_11: Int) val df = Seq( (S1("a1", 101, 11), S2("a2", 102, 12), S3("a3", 103, 13), Array(S4(1, 1), S4(3, 3))), (S1("b1", 201, 21), S2("b2", 202, 22), S3("b3", 203, 23), Array(S4(2, 2), S4(4, 4))) ).toDF("STRUCT1", "STRUCT2", "STRUCT3", "ARRAYSTRUCT4") // +-----------+-----------+-----------+--------------+ // | STRUCT1| STRUCT2| STRUCT3| ARRAYSTRUCT4| // +-----------+-----------+-----------+--------------+ // |[a1,101,11]|[a2,102,12]|[a3,103,13]|[[1,1], [3,3]]| // |[b1,201,21]|[b2,202,22]|[b3,203,23]|[[2,2], [4,4]]| // +-----------+-----------+-----------+--------------+ val arrayCols = df.dtypes.filter( t => t._2.startsWith("ArrayType(StructType") ). map(_._1) // arrayCols: Array[String] = Array(ARRAYSTRUCT4) val expandedDF = arrayCols.foldLeft(df)((accDF, c) => accDF.withColumn(c.replace("ARRAY", ""), explode(col(c))).drop(c) ) val structCols = expandedDF.columns expandedDF.select(structCols.map(c => col(s"$c.*")): _*). show // +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+ // |FIELD_1|FIELD_2|FIELD_3|FIELD_4|FIELD_5|FIELD_6|FIELD_7|FIELD_8|FIELD_9|FIELD_10|FIELD_11| // +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+ // | a1| 101| 11| a2| 102| 12| a3| 103| 13| 1| 1| // | a1| 101| 11| a2| 102| 12| a3| 103| 13| 3| 3| // | b1| 201| 21| b2| 202| 22| b3| 203| 23| 2| 2| // | b1| 201| 21| b2| 202| 22| b3| 203| 23| 4| 4| // +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+ struct