Question

我正在尝试根据当前数据类型（而非列名）动态更改数据框中的数据类型。当前，我正在使用以下代码将所有列数据类型动态更改为StringType，以便在将数据动态加载到Kudu时避免数据类型冲突：

val newdf = df.select(df.columns.map(c => col(c).cast(StringType)) : _*)

我想做的是仅将某些列数据类型更改为所需的替代数据类型（例如，将用DateType定义的所有列都更改为Timestamp）。

我已经搜索了很长时间，但没有找到足够的东西。

任何帮助将不胜感激。

谢谢

格雷格

Answer 1

这是一种实现方式：

df.printSchema
df.show()

root
 |-- Int_Col1: integer (nullable = false)
 |-- Dt_Col1: date (nullable = true)
 |-- Str_Col1: string (nullable = true)
 |-- Dt_Col2: date (nullable = true)

+--------+----------+--------+----------+
|Int_Col1|   Dt_Col1|Str_Col1|   Dt_Col2|
+--------+----------+--------+----------+
|       1|1990-09-30|     AAA|1990-09-30|
|       2|2001-12-14|      BB|1990-09-30|
+--------+----------+--------+----------+

然后仅选择我们需要转换的DateType 并使用TimestampType将其更改为foldLeft。

val result = df.dtypes.collect{ case (dn, dt ) if dt.startsWith("DateType") => (dn,TimestampType)
                            case (dn, dt ) if dt.startsWith("IntegerType") => (dn,DoubleType)
                          }
           .foldLeft(df)((accDF, c) => accDF.withColumn(c._1, col(c._1).cast(c._2)))

result.printSchema
result.show(false)

输出：

root
 |-- Int_Col1: integer (nullable = false)
 |-- Dt_Col1: timestamp (nullable = true)
 |-- Str_Col1: string (nullable = true)
 |-- Dt_Col2: timestamp (nullable = true)

+--------+-------------------+--------+-------------------+
|Int_Col1|Dt_Col1            |Str_Col1|Dt_Col2            |
+--------+-------------------+--------+-------------------+
|1       |1990-09-30 00:00:00|AAA     |1990-09-30 00:00:00|
|2       |2001-12-14 00:00:00|BB      |1990-09-30 00:00:00|
+--------+-------------------+--------+-------------------+

如何基于数据类型在Spark数据框中动态更改列数据类型

1 个答案: