我正在尝试将字符串转换为时间戳
from pyspark.sql import functions as psf
target_df = df \
.withColumn(
'my_ts',
psf.when(
psf.to_timestamp(psf.col("my_ts"), "dd/MM/yyyy HH:mm:ss").isNotNull(),
psf.to_timestamp("my_ts", "dd/MM/yyyy HH:mm:ss")
) \
.psf.when(
psf.to_timestamp(psf.col("my_ts"), "dd-MMM-yy").isNotNull(),
psf.to_timestamp("my_ts", "dd-MMM-yy")
) \
.psf.when(
psf.to_timestamp(psf.col("my_ts"), "yyyyMMdd").isNotNull(),
psf.to_timestamp("my_ts", "yyyyMMdd")
) \
.otherwise(None)
)
但是,出现以下错误:
IllegalArgumentException: 'when() can only be applied on a Column previously generated by when() function'
我尝试将psf.col()
包裹在psf.to_timestamp()
周围,但也遇到错误。任何想法如何解决?
答案 0 :(得分:1)
您几乎在那儿,只是when().psf.when()
不起作用,如果您直接使用时就起作用了。
from pyspark.sql import functions as psf
from pyspark.sql.functions import when
df = sqlContext.createDataFrame(
[
["2019-01-12"],
["20190112"],
["12/01/2019 11:22:11"],
["12-Jan-19"]
], ["my_ts"])
target_df = df \
.withColumn(
'my_new_ts',
when(
psf.to_timestamp(psf.col("my_ts"), "dd/MM/yyyy HH:mm:ss").isNotNull(),
psf.to_timestamp("my_ts", "dd/MM/yyyy HH:mm:ss")
) \
.when(
psf.to_timestamp(psf.col("my_ts"), "dd-MMM-yy").isNotNull(),
psf.to_timestamp("my_ts", "dd-MMM-yy")
) \
.when(
psf.to_timestamp(psf.col("my_ts"), "yyyyMMdd").isNotNull(),
psf.to_timestamp("my_ts", "yyyyMMdd")
) \
.otherwise(None)
)
df.show()
target_df.show()
输出:
+-------------------+
| my_ts|
+-------------------+
| 2019-01-12|
| 20190112|
|12/01/2019 11:22:11|
| 12-Jan-19|
+-------------------+
+-------------------+-------------------+
| my_ts| my_new_ts|
+-------------------+-------------------+
| 2019-01-12| null|
| 20190112|2019-01-12 00:00:00|
|12/01/2019 11:22:11|2019-01-12 11:22:11|
| 12-Jan-19|2019-01-12 00:00:00|
+-------------------+-------------------+
此外,如果您想要更简洁的版本,则可以使用psf.coalesce
:
from pyspark.sql import functions as psf
target_df = df.select("*",
psf.coalesce(
psf.to_timestamp("my_ts", "dd/MM/yyyy HH:mm:ss"),
psf.to_timestamp("my_ts", "dd-MMM-yy"),
psf.to_timestamp("my_ts", "yyyyMMdd")
).alias("my_new_ts"))