重命名数据框列的最佳/最快方法是什么?
我注意到〜.withColumnRename函数在重命名多列时的性能非常差,无论数据帧的大小如何,它也不使用任何执行程序。
我想念什么?
这是我的情况: 我在深度嵌套的结构中有一个具有数千个值的JSON对象。 JSON结构中的变量名称不断重复:
|-- taskData_data_variables: struct (nullable = true)
| |-- ProcessId: string (nullable = true)
| |-- CommentsReceived: struct (nullable = true)
| | |-- @metadata: struct (nullable = true)
| | | |-- dirty: boolean (nullable = true)
| | | |-- invalid: boolean (nullable = true)
| | | |-- objectID: string (nullable = true)
| | | |-- shared: boolean (nullable = true)
| | |-- items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- @metadata: struct (nullable = true)
| | | | | |-- className: string (nullable = true)
| | | | | |-- dirty: boolean (nullable = true)
| | | | | |-- invalid: boolean (nullable = true)
我现在需要将数据“分散”到列和一对多表中。作为该过程的一部分,我需要重命名列以保留其源元数据。
在该过程的第一次迭代结束时,我得到了一个像这样的表:
|-- taskData_data_variables: struct (nullable = true)
|-- taskData_data_variables_ProcessId: string (nullable = true)
|-- taskData_data_variables_CommentsReceived: struct (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata: struct (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_dirty: boolean (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_invalid: boolean (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_objectID: string (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_shared: boolean (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- @metadata: struct (nullable = true)
| | | | | |-- className: string (nullable = true)
| | | | | |-- dirty: boolean (nullable = true)
| | | | | |-- invalid: boolean (nullable = true)
我的代码有效,但是要花很多时间,因为在嵌套级别的各个深度都有数百个这样的值。
我还有一些代码,可以将数组放入另一个表中。
我只需要一些指导/最佳实践建议,以最有效的方式对列进行重命名。
以下是代码:
var dataframeRenamed = spark.sql("SELECT * FROM some_table")
val dfNewSchema = dataframeRenamed.schema
for (renameField <- dfNewSchema) {
dataframeRenamed = dataframeRenamed.withColumnRenamed(renameField.name, renameField.name + "_somethingElse")
}