查找组之间缺少的时间戳

时间:2020-08-26 02:00:20

标签: pyspark

我有一个如下数据框

       Timestamp        SiteID   Count
2020-01-02T05:33:05      1044     5949
2020-01-02T05:50:05      1044     177
2020-01-02T06:00:36      1020     587
2020-01-02T06:01:05      1020     367

我需要生成按 SiteID 分组的每分钟缺少的时间戳记。生成的时间戳记的计数可以为0。

谢谢

1 个答案:

答案 0 :(得分:0)

这是我的尝试。

from pyspark.sql.functions import *

df.groupBy('SiteID').agg(collect_list(unix_timestamp('Timestamp')).alias('Timestamp')) \
  .withColumn('seq', sequence(col('Timestamp')[0], col('Timestamp')[1], lit(60))) \
  .withColumn('seq', explode(array_distinct(array_union('seq', 'Timestamp')))) \
  .withColumn('Timestamp', from_unixtime('seq')) \
  .drop('seq') \
  .join(df, ['SiteId', 'Timestamp'], 'left') \
  .fillna(0).show(20, False)

+------+-------------------+-----+
|SiteID|Timestamp          |Count|
+------+-------------------+-----+
|1020  |2020-01-02 06:00:36|587  |
|1020  |2020-01-02 06:01:05|367  |
|1044  |2020-01-02 05:33:05|5949 |
|1044  |2020-01-02 05:34:05|0    |
|1044  |2020-01-02 05:35:05|0    |
|1044  |2020-01-02 05:36:05|0    |
|1044  |2020-01-02 05:37:05|0    |
|1044  |2020-01-02 05:38:05|0    |
|1044  |2020-01-02 05:39:05|0    |
|1044  |2020-01-02 05:40:05|0    |
|1044  |2020-01-02 05:41:05|0    |
|1044  |2020-01-02 05:42:05|0    |
|1044  |2020-01-02 05:43:05|0    |
|1044  |2020-01-02 05:44:05|0    |
|1044  |2020-01-02 05:45:05|0    |
|1044  |2020-01-02 05:46:05|0    |
|1044  |2020-01-02 05:47:05|0    |
|1044  |2020-01-02 05:48:05|0    |
|1044  |2020-01-02 05:49:05|0    |
|1044  |2020-01-02 05:50:05|177  |
+------+-------------------+-----+