当spark数据帧执行add_months()时,如果列是时间戳类型,但是返回日期类型,我该如何保持小时:分:秒

时间:2016-08-19 02:43:22

标签: apache-spark apache-spark-sql spark-dataframe

我对火花很新。 当我使用add_months()时,如果列是时间戳类型,则返回日期类型。如何保留hour:minute:seconds格式?

df.where($"DEAL_ID" === "deal1" && $"POOL_ID" ==="pool_1")
  .select("LVALID_DEAL_DATE","LAST_PROCESS_DATE")
  .withColumn("test", add_months($"LAST_PROCESS_DATE", -3))
  .show

输出

|    LVALID_DEAL_DATE|   LAST_PROCESS_DATE|      test|
|2016-05-01 00:00:...|2016-08-01 19:38:...|2016-05-01|

2 个答案:

答案 0 :(得分:0)

看起来add_months仅支持日期类型。如果传递了Timestamp类型,则仅返回Date部分。我使用unix_timestamp函数尝试下面的代码,它将HH:mm:ss转换为00。

 df.withColumn("New Dates",unix_timestamp(add_months(df("Dates"),1)).cast("timestamp")).show

答案 1 :(得分:0)

在这里我们可以通过以下方式来实现技巧:首先将时间分量以毫秒为单位来回转换时间戳为unix_epoch,然后将其添加回使用add_months添加月份后得到的新日期

假设我们有一个如下所示的数据框df

df.show()
+---+
| id|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
+---+

import org.apache.spark.sql.types.DateType
val dfttimestamp = df.withColumn("StartDateTimeEpoch", lit(1573362092000L))
                                                .withColumn("StartDateTimeStamp", to_utc_timestamp(to_timestamp(col("StartDateTimeEpoch")/1000), "UTC"))

                                                .withColumn("StartDateTimeTruncated", unix_timestamp(col("StartDateTimeStamp").cast(DateType)) * 1000) //truncate time component by converting to Date
                                                .withColumn("StartTimeMillisDiff", col("StartDateTimeEpoch") - col("StartDateTimeTruncated")) //get time component in millis

                                                .withColumn("StartDate_NextYr", add_months(col("StartDateTimeStamp"),12)) //add 12 months to get next year, as Date column
                                                .withColumn("StartDateTimeEpoch_NextYr", unix_timestamp(col("StartDate_NextYr")) * 1000 + col("StartTimeMillisDiff")) // conver Date to unix-timestamp and add the prevous calculated diff in millis
                                                .withColumn("StartDateTimeStamp_NextYr", to_utc_timestamp(to_timestamp(col("StartDateTimeEpoch_NextYr")/1000), "UTC"))

dfttimestamp.show()
dfttimestamp.printSchema()

+---+------------------+-------------------+----------------------+-------------------+----------------+-------------------------+-------------------------+
| id|StartDateTimeEpoch| StartDateTimeStamp|StartDateTimeTruncated|StartTimeMillisDiff|StartDate_NextYr|StartDateTimeEpoch_NextYr|StartDateTimeStamp_NextYr|
+---+------------------+-------------------+----------------------+-------------------+----------------+-------------------------+-------------------------+
|  1|     1573362092000|2019-11-10 05:01:32|         1573344000000|           18092000|      2020-11-10|            1604984492000|      2020-11-10 05:01:32|
|  2|     1573362092000|2019-11-10 05:01:32|         1573344000000|           18092000|      2020-11-10|            1604984492000|      2020-11-10 05:01:32|
|  3|     1573362092000|2019-11-10 05:01:32|         1573344000000|           18092000|      2020-11-10|            1604984492000|      2020-11-10 05:01:32|
|  4|     1573362092000|2019-11-10 05:01:32|         1573344000000|           18092000|      2020-11-10|            1604984492000|      2020-11-10 05:01:32|
|  5|     1573362092000|2019-11-10 05:01:32|         1573344000000|           18092000|      2020-11-10|            1604984492000|      2020-11-10 05:01:32|
+---+------------------+-------------------+----------------------+-------------------+----------------+-------------------------+-------------------------+

root
 |-- id: integer (nullable = false)
 |-- StartDateTimeEpoch: long (nullable = false)
 |-- StartDateTimeStamp: timestamp (nullable = true)
 |-- StartDateTimeTruncated: long (nullable = true)
 |-- StartTimeMillisDiff: long (nullable = true)
 |-- StartDate_NextYr: date (nullable = true)
 |-- StartDateTimeEpoch_NextYr: long (nullable = true)
 |-- StartDateTimeStamp_NextYr: timestamp (nullable = true)