PySpark-通过When()将字符串转换为时间戳

时间:2020-01-29 17:31:28

标签: pyspark

我正在尝试将字符串转换为时间戳

from pyspark.sql import functions as psf

target_df = df \
    .withColumn(
        'my_ts',
        psf.when(
            psf.to_timestamp(psf.col("my_ts"), "dd/MM/yyyy HH:mm:ss").isNotNull(), 
            psf.to_timestamp("my_ts", "dd/MM/yyyy HH:mm:ss")
        ) \
        .psf.when(
            psf.to_timestamp(psf.col("my_ts"), "dd-MMM-yy").isNotNull(), 
            psf.to_timestamp("my_ts", "dd-MMM-yy")
        ) \
        .psf.when(
            psf.to_timestamp(psf.col("my_ts"), "yyyyMMdd").isNotNull(), 
            psf.to_timestamp("my_ts", "yyyyMMdd")
        ) \
        .otherwise(None)
    )

但是,出现以下错误:

IllegalArgumentException: 'when() can only be applied on a Column previously generated by when() function'

我尝试将psf.col()包裹在psf.to_timestamp()周围,但也遇到错误。任何想法如何解决?

1 个答案:

答案 0 :(得分:1)

您几乎在那儿,只是when().psf.when()不起作用,如果您直接使用时就起作用了。

from pyspark.sql import functions as psf
from pyspark.sql.functions import when

df = sqlContext.createDataFrame(
    [
        ["2019-01-12"],
        ["20190112"],
        ["12/01/2019 11:22:11"],
        ["12-Jan-19"]
    ], ["my_ts"])

target_df = df \
    .withColumn(
        'my_new_ts',
        when(
            psf.to_timestamp(psf.col("my_ts"), "dd/MM/yyyy HH:mm:ss").isNotNull(), 
            psf.to_timestamp("my_ts", "dd/MM/yyyy HH:mm:ss")
        ) \
        .when(
            psf.to_timestamp(psf.col("my_ts"), "dd-MMM-yy").isNotNull(), 
            psf.to_timestamp("my_ts", "dd-MMM-yy")
        ) \
        .when(
            psf.to_timestamp(psf.col("my_ts"), "yyyyMMdd").isNotNull(), 
            psf.to_timestamp("my_ts", "yyyyMMdd")
        ) \
        .otherwise(None)
    )

df.show()
target_df.show()

输出:


+-------------------+
|              my_ts|
+-------------------+
|         2019-01-12|
|           20190112|
|12/01/2019 11:22:11|
|          12-Jan-19|
+-------------------+

+-------------------+-------------------+
|              my_ts|          my_new_ts|
+-------------------+-------------------+
|         2019-01-12|               null|
|           20190112|2019-01-12 00:00:00|
|12/01/2019 11:22:11|2019-01-12 11:22:11|
|          12-Jan-19|2019-01-12 00:00:00|
+-------------------+-------------------+

此外,如果您想要更简洁的版本,则可以使用psf.coalesce

from pyspark.sql import functions as psf

target_df = df.select("*", 
                psf.coalesce(
                    psf.to_timestamp("my_ts", "dd/MM/yyyy HH:mm:ss"),
                    psf.to_timestamp("my_ts", "dd-MMM-yy"),
                    psf.to_timestamp("my_ts", "yyyyMMdd")
                ).alias("my_new_ts"))