将时间分为每30分钟一次

时间:2018-03-11 06:32:19

标签: datetime apache-spark time pyspark-sql

我有Dataframe包含“时间”列我希望在将时间分成每30分钟的时间后添加一个包含句号的新列 例如, 原始的Dataframe

l = [('A','2017-01-13 00:30:00'),('A','2017-01-13 00:00:01'),('E','2017-01-13 14:00:00'),('E','2017-01-13 12:08:15')]
df = spark.createDataFrame(l,['test','time'])
df1 = df.select(df.test,df.time.cast('timestamp'))
df1.show()
+----+-------------------+
|test|               time|
+----+-------------------+
|   A|2017-01-13 00:30:00|
|   A|2017-01-13 00:00:01|
|   E|2017-01-13 14:00:00|
|   E|2017-01-13 12:08:15|
+----+-------------------+

Desired Dataframe如下:

+----+-------------------+------+
|test|               time|period|
+----+-------------------+------+
|   A|2017-01-13 00:30:00|     2|
|   A|2017-01-13 00:00:01|     1|
|   E|2017-01-13 14:00:00|    29|
|   E|2017-01-13 12:08:15|    25|
+----+-------------------+------+

有没有办法实现这个目标?

1 个答案:

答案 0 :(得分:1)

您可以使用hourminute 内置函数来获取when 内置函数的最终结果

from pyspark.sql import functions as F
df1.withColumn('period', (F.hour(df1['time'])*2)+1+(F.when(F.minute(df1['time']) >= 30, 1).otherwise(0))).show(truncate=False)

你应该得到

+----+---------------------+------+
|test|time                 |period|
+----+---------------------+------+
|A   |2017-01-13 00:30:00.0|2     |
|A   |2017-01-13 00:00:01.0|1     |
|E   |2017-01-13 14:00:00.0|29    |
|E   |2017-01-13 12:08:15.0|25    |
+----+---------------------+------+

我希望答案很有帮助