PySpark在时间序列窗口中按ID的最后一个值之和

时间:2018-11-15 14:42:37

标签: python apache-spark pyspark apache-spark-sql pyspark-sql

我在PySpark中有这个DataFrame:

[Row(id='487', value=35185, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 6095), timestamp=1532354662),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 15215), timestamp=1532354662),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 25456), timestamp=1532354662),
 Row(id='48D', value=35276, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 35641), timestamp=1532354662),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 44516), timestamp=1532354662),
 Row(id='487', value=35185, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 106098), timestamp=1532354662),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 108248), timestamp=1532354662),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 118453), timestamp=1532354662),
 Row(id='48D', value=35276, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 129638), timestamp=1532354662),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 138515), timestamp=1532354662),
 Row(id='487', value=35185, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 206095), timestamp=1532354662),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 215213), timestamp=1532354662),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 225445), timestamp=1532354662),
 Row(id='48D', value=35276, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 234635), timestamp=1532354662),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 244514), timestamp=1532354662),
 Row(id='487', value=35185, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 306095), timestamp=1532354662),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 309226), timestamp=1532354662),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 319454), timestamp=1532354662),
 Row(id='48D', value=35276, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 329651), timestamp=1532354662),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 337523), timestamp=1532354662),
 Row(id='487', value=35184, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 406077), timestamp=1532354662),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 415209), timestamp=1532354662),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 425481), timestamp=1532354662),
 Row(id='48D', value=35276, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 435638), timestamp=1532354662),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 445548), timestamp=1532354662),
 Row(id='487', value=35184, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 506073), timestamp=1532354662),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 508245), timestamp=1532354662),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 519452), timestamp=1532354662),
 Row(id='48D', value=35276, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 529641), timestamp=1532354662),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 537512), timestamp=1532354662),
 Row(id='487', value=35184, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 606087), timestamp=1532354662),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 615193), timestamp=1532354662),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 625452), timestamp=1532354662),
 Row(id='48D', value=35276, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 635632), timestamp=1532354662),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 645538), timestamp=1532354662),
 Row(id='487', value=35184, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 706073), timestamp=1532354662),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 709212), timestamp=1532354662),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 718452), timestamp=1532354662),
 Row(id='48D', value=35275, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 729642), timestamp=1532354662),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 738524), timestamp=1532354662),
 Row(id='487', value=35184, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 806095), timestamp=1532354662),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 815210), timestamp=1532354662),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 825455), timestamp=1532354662),
 Row(id='48D', value=35275, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 834640), timestamp=1532354662),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 844520), timestamp=1532354662),
 Row(id='487', value=35184, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 906083), timestamp=1532354662),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 908243), timestamp=1532354662),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 918445), timestamp=1532354662),
 Row(id='48D', value=35275, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 929632), timestamp=1532354662),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 938511), timestamp=1532354662)]

我只需要在第二个窗口中为每个 id 求和最后一个已知的

出处应如下所示:

 [Row(time=datetime.datetime(2018, 7, 23, 14, 4, 22), sum=176213),
 Row(time=datetime.datetime(2018, 7, 23, 14, 4, 23), sum=176112),
 Row(time=datetime.datetime(2018, 7, 23, 14, 4, 24), sum=175933),
 Row(time=datetime.datetime(2018, 7, 23, 14, 4, 25), sum=175543),
 Row(time=datetime.datetime(2018, 7, 23, 14, 4, 26), sum=175219),
 Row(time=datetime.datetime(2018, 7, 23, 14, 4, 27), sum=175002),
 Row(time=datetime.datetime(2018, 7, 23, 14, 4, 28), sum=174892)...]

我已经尝试过了:

w = Window.partitionBy(F.col('id')).orderBy('timestamp')
_df.withColumn('last_known', F.last('value').over(w)).sort('time').take(1000)

它会为每个id生成最后一个已知值的新列,但我不知道如何对其求和。

[Row(id='487', value=35185, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 6095), timestamp=1532354662, last_known=35184),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 15215), timestamp=1532354662, last_known=35285),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 25456), timestamp=1532354662, last_known=35211),
 Row(id='48D', value=35276, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 35641), timestamp=1532354662, last_known=35275),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 44516), timestamp=1532354662, last_known=35187),
 Row(id='487', value=35185, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 106098), timestamp=1532354662, last_known=35184),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 108248), timestamp=1532354662, last_known=35285),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 118453), timestamp=1532354662, last_known=35211),
 Row(id='48D', value=35276, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 129638), timestamp=1532354662, last_known=35275),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 138515), timestamp=1532354662, last_known=35187),
 Row(id='487', value=35185, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 206095), timestamp=1532354662, last_known=35184),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 215213), timestamp=1532354662, last_known=35285),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 225445), timestamp=1532354662, last_known=35211),
 Row(id='48D', value=35276, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 234635), timestamp=1532354662, last_known=35275),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 244514), timestamp=1532354662, last_known=35187),
 Row(id='487', value=35185, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 306095), timestamp=1532354662, last_known=35184),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 309226), timestamp=1532354662, last_known=35285),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 319454), timestamp=1532354662, last_known=35211),
 Row(id='48D', value=35276, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 329651), timestamp=1532354662, last_known=35275),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 337523), timestamp=1532354662, last_known=35187),
 Row(id='487', value=35184, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 406077), timestamp=1532354662, last_known=35184),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 415209), timestamp=1532354662, last_known=35285),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 425481), timestamp=1532354662, last_known=35211),
 Row(id='48D', value=35276, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 435638), timestamp=1532354662, last_known=35275),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 445548), timestamp=1532354662, last_known=35187),
 Row(id='487', value=35184, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 506073), timestamp=1532354662, last_known=35184),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 508245), timestamp=1532354662, last_known=35285),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 519452), timestamp=1532354662, last_known=35211),
 Row(id='48D', value=35276, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 529641), timestamp=1532354662, last_known=35275),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 537512), timestamp=1532354662, last_known=35187),
 Row(id='487', value=35184, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 606087), timestamp=1532354662, last_known=35184),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 615193), timestamp=1532354662, last_known=35285),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 625452), timestamp=1532354662, last_known=35211),
 Row(id='48D', value=35276, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 635632), timestamp=1532354662, last_known=35275),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 645538), timestamp=1532354662, last_known=35187),
 Row(id='487', value=35184, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 706073), timestamp=1532354662, last_known=35184),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 709212), timestamp=1532354662, last_known=35285),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 718452), timestamp=1532354662, last_known=35211),
 Row(id='48D', value=35275, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 729642), timestamp=1532354662, last_known=35275),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 738524), timestamp=1532354662, last_known=35187),
 Row(id='487', value=35184, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 806095), timestamp=1532354662, last_known=35184),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 815210), timestamp=1532354662, last_known=35285),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 825455), timestamp=1532354662, last_known=35211),
 Row(id='48D', value=35275, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 834640), timestamp=1532354662, last_known=35275),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 844520), timestamp=1532354662, last_known=35187),
 Row(id='487', value=35184, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 906083), timestamp=1532354662, last_known=35184),
 Row(id='489', value=35285, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 908243), timestamp=1532354662, last_known=35285),
 Row(id='48B', value=35211, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 918445), timestamp=1532354662, last_known=35211),
 Row(id='48D', value=35275, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 929632), timestamp=1532354662, last_known=35275),
 Row(id='48F', value=35187, time=datetime.datetime(2018, 7, 23, 14, 4, 22, 938511), timestamp=1532354662, last_known=35187)]

其他解决方案是:

_df.orderBy('time').groupBy('timestamp', 'id').agg(F.last('value').alias('last'))\
.groupBy('timestamp').agg(F.sum('last').alias('sum'))\
.sort('timestamp').take(50)

输出看起来很有希望,但是双重聚合似乎很麻烦……而且它应该在TB的数据上运行,因此速度也值得关注。

[Row(timestamp=1532354662, sum=176142),
 Row(timestamp=1532354663, sum=176142),
 Row(timestamp=1532354664, sum=176139),
 Row(timestamp=1532354665, sum=176137),
 Row(timestamp=1532354666, sum=176133),
 Row(timestamp=1532354667, sum=176128),
 Row(timestamp=1532354668, sum=176125),
 Row(timestamp=1532354669, sum=176122),
 Row(timestamp=1532354670, sum=176120),
 Row(timestamp=1532354671, sum=176118),
 Row(timestamp=1532354672, sum=176117),
 Row(timestamp=1532354673, sum=176114),

任何帮助将不胜感激!

编辑 特里的答案是最好的。如果有人有更好的主意,请发布。

1 个答案:

答案 0 :(得分:0)

我相信您可以将第一个groupBy替换为“ dropDuplicates”,并在orderBy中将ascending设置为False。像这样:

<button class="cameraPermissionButton" id="cameraButtonId">Enable Camera Access</button>