我在Pyspark
中有一个数据框。在这个数据框中,我有一个timestamp
数据类型的列。现在,我想为timestamp列的每一行添加额外的2小时,而不创建任何新列。
例如:这是样本数据
df
id testing_time test_name
1 2017-03-12 03:19:58 Raising
2 2017-03-12 03:21:30 sleeping
3 2017-03-12 03:29:40 walking
4 2017-03-12 03:31:23 talking
5 2017-03-12 04:19:47 eating
6 2017-03-12 04:33:51 working
我希望得到类似下面的内容。
df1
id testing_time test_name
1 2017-03-12 05:19:58 Raising
2 2017-03-12 05:21:30 sleeping
3 2017-03-12 05:29:40 walking
4 2017-03-12 05:31:23 talking
5 2017-03-12 06:19:47 eating
6 2017-03-12 06:33:51 working
我该怎么做?
答案 0 :(得分:9)
您可以使用testing_time
函数在{em>秒中将unix_timestamp
列转换为 bigint ,添加2小时(7200秒),然后投射结果回到时间戳:
import pyspark.sql.functions as F
df.withColumn("testing_time", (F.unix_timestamp("testing_time") + 7200).cast('timestamp')).show()
+---+-------------------+---------+
| id| testing_time|test_name|
+---+-------------------+---------+
| 1|2017-03-12 05:19:58| Raising|
| 2|2017-03-12 05:21:30| sleeping|
| 3|2017-03-12 05:29:40| walking|
| 4|2017-03-12 05:31:23| talking|
| 5|2017-03-12 06:19:47| eating|
| 6|2017-03-12 06:33:51| working|
+---+-------------------+---------+
答案 1 :(得分:7)
一种方法,不需要显式转换并使用Spark间隔文字(具有可论证的可读性优势):
df = df.withColumn('testing_time', df.testing_time + F.expr('INTERVAL 2 HOURS'))
df.show()
+---+-------------------+---------+
| id| testing_time|test_name|
+---+-------------------+---------+
| 1|2017-03-12 05:19:58| Raising|
| 2|2017-03-12 05:21:30| sleeping|
| 3|2017-03-12 05:29:40| walking|
| 4|2017-03-12 05:31:23| talking|
| 5|2017-03-12 06:19:47| eating|
| 6|2017-03-12 06:33:51| working|
+---+-------------------+---------+
或者,完整:
import pyspark.sql.functions as F
from datetime import datetime
data = [
(1, datetime(2017, 3, 12, 3, 19, 58), 'Raising'),
(2, datetime(2017, 3, 12, 3, 21, 30), 'sleeping'),
(3, datetime(2017, 3, 12, 3, 29, 40), 'walking'),
(4, datetime(2017, 3, 12, 3, 31, 23), 'talking'),
(5, datetime(2017, 3, 12, 4, 19, 47), 'eating'),
(6, datetime(2017, 3, 12, 4, 33, 51), 'working'),
]
df = sqlContext.createDataFrame(data, ['id', 'testing_time', 'test_name'])
df = df.withColumn('testing_time', df.testing_time + F.expr('INTERVAL 2 HOURS'))
df.show()
+---+-------------------+---------+
| id| testing_time|test_name|
+---+-------------------+---------+
| 1|2017-03-12 05:19:58| Raising|
| 2|2017-03-12 05:21:30| sleeping|
| 3|2017-03-12 05:29:40| walking|
| 4|2017-03-12 05:31:23| talking|
| 5|2017-03-12 06:19:47| eating|
| 6|2017-03-12 06:33:51| working|
+---+-------------------+---------+
答案 2 :(得分:0)
基于@Psidom答案,
因为在我的情况下,列testing_base
的时间格式变化很大,而不是在我的情况下使用F.unix_timestamp("testing_time", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
,它适用于毫秒级的记录,但对于秒级的记录返回null……我是这样处理的:
import pyspark.sql.functions as F
df.withColumn("testing_time",
(F.unix_timestamp(F.col("testing_time").cast("timestamp")) + 7200).cast('timestamp'))
通过这种方式,无论字段testing_time
的时间格式是什么,它都由Pyspark提供的强制转换功能处理。