我有一个如下数据框
Timestamp SiteID Count
2020-01-02T05:33:05 1044 5949
2020-01-02T05:50:05 1044 177
2020-01-02T06:00:36 1020 587
2020-01-02T06:01:05 1020 367
我需要生成按 SiteID 分组的每分钟缺少的时间戳记。生成的时间戳记的计数可以为0。
谢谢
答案 0 :(得分:0)
这是我的尝试。
from pyspark.sql.functions import *
df.groupBy('SiteID').agg(collect_list(unix_timestamp('Timestamp')).alias('Timestamp')) \
.withColumn('seq', sequence(col('Timestamp')[0], col('Timestamp')[1], lit(60))) \
.withColumn('seq', explode(array_distinct(array_union('seq', 'Timestamp')))) \
.withColumn('Timestamp', from_unixtime('seq')) \
.drop('seq') \
.join(df, ['SiteId', 'Timestamp'], 'left') \
.fillna(0).show(20, False)
+------+-------------------+-----+
|SiteID|Timestamp |Count|
+------+-------------------+-----+
|1020 |2020-01-02 06:00:36|587 |
|1020 |2020-01-02 06:01:05|367 |
|1044 |2020-01-02 05:33:05|5949 |
|1044 |2020-01-02 05:34:05|0 |
|1044 |2020-01-02 05:35:05|0 |
|1044 |2020-01-02 05:36:05|0 |
|1044 |2020-01-02 05:37:05|0 |
|1044 |2020-01-02 05:38:05|0 |
|1044 |2020-01-02 05:39:05|0 |
|1044 |2020-01-02 05:40:05|0 |
|1044 |2020-01-02 05:41:05|0 |
|1044 |2020-01-02 05:42:05|0 |
|1044 |2020-01-02 05:43:05|0 |
|1044 |2020-01-02 05:44:05|0 |
|1044 |2020-01-02 05:45:05|0 |
|1044 |2020-01-02 05:46:05|0 |
|1044 |2020-01-02 05:47:05|0 |
|1044 |2020-01-02 05:48:05|0 |
|1044 |2020-01-02 05:49:05|0 |
|1044 |2020-01-02 05:50:05|177 |
+------+-------------------+-----+