Pyspark:在特定时间范围内将值串联到列表中

时间:2019-07-15 16:12:30

标签: python python-3.x apache-spark pyspark

我有一个pyspark数据框,其中包含idtimestampvalue列。我正在尝试创建一个数据框,该数据框首先将具有相同ID的行分组,然后将相距超过2周的行分开,最后将其value连接到一个列表中。

我已经尝试使用rangeBetween()窗口函数。它不能完全满足我的需求。我认为以下代码更好地说明了我的问题:

我的数据框sdf

+---+-------------------------+-----+
|id |tts                      |value|
+---+-------------------------+-----+
|0  |2019-01-01T00:00:00+00:00|a    |
|0  |2019-01-02T00:00:00+00:00|b    |
|0  |2019-01-20T00:00:00+00:00|c    |
|0  |2019-01-25T00:00:00+00:00|d    |
|1  |2019-01-02T00:00:00+00:00|a    |
|1  |2019-01-29T00:00:00+00:00|b    |
|2  |2019-01-01T00:00:00+00:00|a    |
|2  |2019-01-30T00:00:00+00:00|b    |
|2  |2019-02-02T00:00:00+00:00|c    |
+---+-------------------------+-----+

我的方法:

from pyspark.sql.window import Window
from pyspark.sql import functions as F

DAY_SECS = 3600 * 24
w_spec = Window \
         .partitionBy('id') \
         .orderBy(F.col('tts').cast('timestamp').cast('long')) \
         .rangeBetween((Window.currentRow)-(14*DAY_SECS), Window.currentRow)
out = sdf \
        .withColumn('val_seq', F.collect_list('value').over(w_spec))

输出:

+---+-------------------------+-----+-------+
|id |tts                      |value|val_seq|
+---+-------------------------+-----+-------+
|0  |2019-01-01T00:00:00+00:00|a    |[a]    |
|0  |2019-01-02T00:00:00+00:00|b    |[a, b] |
|0  |2019-01-20T00:00:00+00:00|c    |[c]    |
|0  |2019-01-25T00:00:00+00:00|d    |[c, d] |
|1  |2019-01-02T00:00:00+00:00|a    |[a]    |
|1  |2019-01-29T00:00:00+00:00|b    |[b]    |
|2  |2019-01-01T00:00:00+00:00|a    |[a]    |
|2  |2019-01-30T00:00:00+00:00|b    |[b]    |
|2  |2019-02-02T00:00:00+00:00|c    |[b, c] |
+---+-------------------------+-----+-------+

我想要的输出:

+---+-------------------------+---------+
|id |tts                      |val_seq|
+---+-------------------------+---------+
|0  |2019-01-02T00:00:00+00:00|[a, b]   |
|0  |2019-01-25T00:00:00+00:00|[c, d]   |
|1  |2019-01-02T00:00:00+00:00|[a]      |
|1  |2019-01-29T00:00:00+00:00|[b]      |
|2  |2019-01-30T00:00:00+00:00|[a]      |
|2  |2019-02-02T00:00:00+00:00|[b, c]   |
+---+-------------------------+---------+

总结一下:我想将sdf中的行与相同的id进行分组,并进一步将value连接到相隔不超过2周的行,最后仅显示这些行。

我真的是pyspark的新手,所以任何建议都值得赞赏!

1 个答案:

答案 0 :(得分:0)

下面的代码应该可以工作:

 w_spec = Window \
                 .partitionBy('id') \
                 .orderBy(F.col('tts').cast('timestamp').cast('long')) \
                 .rangeBetween((Window.currentRow)-(14*DAY_SECS), Window.currentRow)
        w_spec2 = Window.partitionBy("id").orderBy(("occurrences_in_5_min").desc())
        out = df.withColumn('val_seq', F.collect_list('value').over(w_spec)).withColumn('occurrences_in_5_min',F.count('tts').over(w_spec)).withColumn("rank",rank().over(w_spec2)).filter("rank==1")