Question

我正在通过从本地文件加载来创建Spark（2.2.0）DataFrame。文件加载按预期发生，我得到了以下DF。

scala> df.show(4, false)
+--------+------------------+------------+----------------------------+
|userId  |legacyProductId   |optInFlag   |transaction_date            |
+--------+------------------+------------+----------------------------+
|71844441|805934            |null        |Sat Oct 15 23:35:22 UTC 2005|
|71844441|714837            |null        |Sat Apr 09 10:04:30 UTC 2005|
|71844441|732860            |null        |Sat Mar 19 17:30:26 UTC 2005|
|71844441|1170951           |null        |Sat Mar 19 17:30:26 UTC 2005|
+--------+------------------+------------+----------------------------+
only showing top 4 rows

前两列为integer，后两列为string。我想将transaction_date列转换为unix时间戳。我做了以下事情。

val newdf = df.select($"userId", $"legacyProductId", $"OptInFlag", unix_timestamp($"transaction_date", "EEE MMM dd hh:mm:ss z yyyy"))

这样，我就得到了以毫秒为单位的最后一列。但是，并非所有行都进行转换，如下所示。

scala> newdf.show(4, false)
+--------+------------------+------------+------------------------------------------------------------+
|userId  |legacyProductId   |OptInFlag   |unix_timestamp(transaction_date, EEE MMM dd hh:mm:ss z yyyy)|
+--------+------------------+------------+------------------------------------------------------------+
|71844441|805934            |null        |null                                                        |
|71844441|714837            |null        |1113041070                                                  |
|71844441|732860            |null        |null                                                        |
|71844441|1170951           |null        |null                                                        |
+--------+------------------+------------+------------------------------------------------------------+
only showing top 4 rows

仅第二行时间戳被成功转换。其余的失败，并设置为null。

我是否正确指定格式字符串EEE MMM d hh:mm:ss z yyyy？我该如何调试呢？

Answer 1

那是因为hh is

上午/下午（1-12）的时间

您应该使用HH：

一天中的小时（0-23）

喜欢

scala> spark.sql("SELECT unix_timestamp('Sat Mar 19 17:30:26 UTC 2005', 'EEE MMM dd HH:mm:ss zzz yyyy')").show
// +--------------------------------------------------------------------------+
// |unix_timestamp(Sat Mar 19 17:30:26 UTC 2005, EEE MMM dd HH:mm:ss zzz yyyy)|
// +--------------------------------------------------------------------------+
// |                                                                1111253426|
// +--------------------------------------------------------------------------+

Spark无法解析时间戳文件

1 个答案: