我知道这个问题在Stack Overflow上已经问过很多遍了,并且在大多数帖子中都得到了令人满意的回答,但是我不确定这是否是我的最佳方法。 我有一个嵌入了几种结构类型的数据集:
root
|-- STRUCT1: struct (nullable = true)
| |-- FIELD_1: string (nullable = true)
| |-- FIELD_2: long (nullable = true)
| |-- FIELD_3: integer (nullable = true)
|-- STRUCT2: struct (nullable = true)
| |-- FIELD_4: string (nullable = true)
| |-- FIELD_5: long (nullable = true)
| |-- FIELD_6: integer (nullable = true)
|-- STRUCT3: struct (nullable = true)
| |-- FIELD_7: string (nullable = true)
| |-- FIELD_8: long (nullable = true)
| |-- FIELD_9: integer (nullable = true)
|-- ARRAYSTRUCT4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FIELD_10: integer (nullable = true)
| | |-- FIELD_11: integer (nullable = true)
+-------+------------+------------+------------------+
|STRUCT1| STRUCT2 | STRUCT3 | ARRAYSTRUCT4 |
+-------+------------+------------+------------------+
|[1,2,3]|[aa, xx, yy]|[p1, q2, r3]|[[1a, 2b],[3c,4d]]|
+-------+------------+------------+------------------+
我想将其转换为:
1.将结构扩展为列的数据集。
2.数组(ARRAYSTRUCT4)分解为行的数据集。
root
|-- FIELD_1: string (nullable = true)
|-- FIELD_2: long (nullable = true)
|-- FIELD_3: integer (nullable = true)
|-- FIELD_4: string (nullable = true)
|-- FIELD_5: long (nullable = true)
|-- FIELD_6: integer (nullable = true)
|-- FIELD_7: string (nullable = true)
|-- FIELD_8: long (nullable = true)
|-- FIELD_9: integer (nullable = true)
|-- FIELD_10: integer (nullable = true)
|-- FIELD_11: integer (nullable = true)
+-------+------------+------------+---------+ ---------+----------+
|FIELD_1| FIELD_2 | FIELD_3 | FIELD_4 | |FIELD_10| FIELD_11 |
+-------+------------+------------+---------+ ... ---------+----------+
|1 |2 |3 | aa | | 1a | 2b |
+-------+------------+------------+-----------------------------------+
要实现这一目标,我可以使用:
val expanded = df.select("STRUCT1.*", "STRUCT2.*", "STRUCT3.*", "STRUCT4")
后跟爆炸:
val exploded = expanded.select(explode(expanded("STRUCT4")))
但是,我想知道是否还有一种更实用的方法来做到这一点,尤其是选择。我可以按以下方式使用withColumn
:
data.withColumn("FIELD_1", $"STRUCT1".getItem(0))
.withColumn("FIELD_2", $"STRUCT1".getItem(1))
.....
但是我有80多个专栏。有没有更好的方法可以做到这一点?
答案 0 :(得分:2)
您可以先通过list("67")
使所有列['6', '7']
成为类型,然后通过struct
将任何explode
列放入Array(struct)
列中,然后使用{{1 }},将每个struct
列名插值到foldLeft
中,如下所示:
map
请注意,为简单起见,假设您的DataFrame仅具有struct
和col.*
类型的列。如果还有其他数据类型,只需将过滤条件应用于import org.apache.spark.sql.functions._
case class S1(FIELD_1: String, FIELD_2: Long, FIELD_3: Int)
case class S2(FIELD_4: String, FIELD_5: Long, FIELD_6: Int)
case class S3(FIELD_7: String, FIELD_8: Long, FIELD_9: Int)
case class S4(FIELD_10: Int, FIELD_11: Int)
val df = Seq(
(S1("a1", 101, 11), S2("a2", 102, 12), S3("a3", 103, 13), Array(S4(1, 1), S4(3, 3))),
(S1("b1", 201, 21), S2("b2", 202, 22), S3("b3", 203, 23), Array(S4(2, 2), S4(4, 4)))
).toDF("STRUCT1", "STRUCT2", "STRUCT3", "ARRAYSTRUCT4")
// +-----------+-----------+-----------+--------------+
// | STRUCT1| STRUCT2| STRUCT3| ARRAYSTRUCT4|
// +-----------+-----------+-----------+--------------+
// |[a1,101,11]|[a2,102,12]|[a3,103,13]|[[1,1], [3,3]]|
// |[b1,201,21]|[b2,202,22]|[b3,203,23]|[[2,2], [4,4]]|
// +-----------+-----------+-----------+--------------+
val arrayCols = df.dtypes.filter( t => t._2.startsWith("ArrayType(StructType") ).
map(_._1)
// arrayCols: Array[String] = Array(ARRAYSTRUCT4)
val expandedDF = arrayCols.foldLeft(df)((accDF, c) =>
accDF.withColumn(c.replace("ARRAY", ""), explode(col(c))).drop(c)
)
val structCols = expandedDF.columns
expandedDF.select(structCols.map(c => col(s"$c.*")): _*).
show
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
// |FIELD_1|FIELD_2|FIELD_3|FIELD_4|FIELD_5|FIELD_6|FIELD_7|FIELD_8|FIELD_9|FIELD_10|FIELD_11|
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
// | a1| 101| 11| a2| 102| 12| a3| 103| 13| 1| 1|
// | a1| 101| 11| a2| 102| 12| a3| 103| 13| 3| 3|
// | b1| 201| 21| b2| 202| 22| b3| 203| 23| 2| 2|
// | b1| 201| 21| b2| 202| 22| b3| 203| 23| 4| 4|
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
和struct
。