从pyspark数据框获取平均日期值

时间:2020-10-14 18:48:03

标签: apache-spark pyspark apache-spark-sql

我有一个包含具有以下架构的产品数据的df

root
 |-- Creator: string (nullable = true)
 |-- Created_datetime: timestamp (nullable = true)
 |-- Last_modified_datetime: timestamp (nullable = true)
 |-- Product_name: string (nullable = true)

Created_datetime列看起来如下

+-------------------+
|   Created_datetime|
+-------------------+
|2019-10-12 17:09:18|
|2019-12-03 07:02:07|
|2020-01-16 23:10:08|

现在,我想提取Created_datetime列中的平均值(或最接近现有平均值的平均值)。如何实现?

1 个答案:

答案 0 :(得分:1)

计算timestamp列的平均值时,它将为您提供unix timestamp (long)的平均值。将其投射回timestamp

from pyspark.sql.functions import *
from pyspark.sql import functions as F

df.agg(F.avg("Created_datetime").cast("timestamp").alias("avg_created_datetime")).show()
+--------------------+                                                          
|avg_created_datetime|
+--------------------+
| 2019-11-30 23:27:11|
+--------------------+