Question

我有一个带有以下架构的DataFrame

root
 |-- col_a: string (nullable = false)
 |-- col_b: string (nullable = false)
 |-- col_c_a: string (nullable = false)
 |-- col_c_b: string (nullable = false)
 |-- col_d: string (nullable = false)
 |-- col_e: string (nullable = false)
 |-- col_f: string (nullable = false)

现在我想将此数据框架构转换为类似的内容。

root
 |-- col_a: string (nullable = false)
 |-- col_b: string (nullable = false)
 |-- col_c: struct (nullable = false)
     |-- col_c_a: string (nullable = false)
     |-- col_c_b: string (nullable = false)
 |-- col_d: string (nullable = false)
 |-- col_e: string (nullable = false)
 |-- col_f: string (nullable = false)

我可以在map转换的帮助下通过显式从row类型中获取每列的值来执行此操作，但这是一个非常复杂的过程并且看起来不太好，所以，

有什么方法可以达到这个目的吗？

由于

Answer 1

有一个内置的struct函数，定义为：

def struct(cols: Column*): Column

您可以像以下一样使用它：

df.show
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  2|  3|
+---+---+

df.withColumn("struct_col", struct($"a", $"b")).show
+---+---+----------+
|  a|  b|struct_col|
+---+---+----------+
|  1|  2|     [1,2]|
|  2|  3|     [2,3]|
+---+---+----------+

新数据框的架构为：

 |-- a: integer (nullable = false)
 |-- b: integer (nullable = false)
 |-- struct_col: struct (nullable = false)
 |    |-- a: integer (nullable = false)
 |    |-- b: integer (nullable = false)

在您的情况下，您可以执行以下操作：

df.withColumn("col_c" , struct($"col_c_a", $"col_c_b") ).drop($"col_c_a").drop($"col_c_b")

在Apache Spark中更新DataFrame的架构

1 个答案: