如何将时间戳类型的PySpark数据帧截断到当天?

时间:2018-04-20 18:48:16

标签: apache-spark pyspark apache-spark-sql pyspark-sql

我有一个PySpark数据框,其中包含一列中的时间戳(调用列'dt'),如下所示:

2018-04-07 16:46:00
2018-03-06 22:18:00

执行时:

SELECT trunc(dt, 'day') as day

......我期待:

2018-04-07 00:00:00
2018-03-06 00:00:00

但我得到了:

null
null

如何截断日期而不是小时?

3 个答案:

答案 0 :(得分:8)

你使用了错误的功能。 trunc supports only a few formats

  

返回截断为格式指定单位的日期。

     

:param格式:'年',' yyyy',' yy'或者'月',' mon'' mm'

使用date_trunc instead

  

返回截断为格式指定单位的时间戳。

     

:param格式:'年',' yyyy',' yy',' month',' mon&#39 ;,' mm',       ' day',' dd',' hour','分钟','秒','周& #39;,'季度'

示例:

from pyspark.sql.functions import col, trunc_date

df = spark.createDataFrame(["2018-04-07 23:33:21"], "string").toDF("dt").select(col("dt").cast("timestamp"))

df.select(date_trunc("day", "dt")).show()
# +-------------------+                                                           
# |date_trunc(day, dt)|
# +-------------------+
# |2018-04-07 00:00:00|
# +-------------------+

答案 1 :(得分:0)

使用字符串操作执行此操作的一种简单方法:

from pyspark.sql.functions import lit, concat

df = df.withColumn('date', concat(df.date.substr(0, 10), lit(' 00:00:00'))) 

答案 2 :(得分:0)

对于火花<= 2.2.0

请使用此:

from pyspark.sql.functions import weekofyear, year, to_date, concat, lit, col
from pyspark.sql.session import SparkSession
from pyspark.sql.types import TimestampType

spark = SparkSession.builder.getOrCreate()

spark.createDataFrame([['2020-10-03 05:00:00']], schema=['timestamp']) \
    .withColumn('timestamp', col('timestamp').astype(TimestampType())) \
    .withColumn('date', to_date('timestamp').astype(TimestampType())) \
    .show(truncate=False)

+-------------------+-------------------+
|timestamp          |date               |
+-------------------+-------------------+
|2020-10-03 05:00:00|2020-10-03 00:00:00|
+-------------------+-------------------+

对于Spark> 2.2.0 datetime patterns in spark 3.0.0

from pyspark.sql.functions import date_trunc, col
from pyspark.sql.session import SparkSession
from pyspark.sql.types import TimestampType

spark = SparkSession.builder.getOrCreate()

spark.createDataFrame([['2020-10-03 05:00:00']], schema=['timestamp']) \
    .withColumn('timestamp', col('timestamp').astype(TimestampType())) \
    .withColumn('date', date_trunc(timestamp='timestamp', format='day')) \
    .show(truncate=False)

+-------------------+-------------------+
|timestamp          |date               |
+-------------------+-------------------+
|2020-10-03 05:00:00|2020-10-03 00:00:00|
+-------------------+-------------------+