我想知道是否可以使用pyspark来按组计算数据集的时差。 例如我有
<svg xmlns="http://www.w3.org/2000/svg" width="400" height="400" viewBox="-35 -35 270 270">
<defs>
<linearGradient id="redyel" gradientUnits="objectBoundingBox" x1="0" y1="0" x2="1" y2="1">
<stop offset="0%" stop-color="#ff0000" :stop-opacity="saturation"/>
<stop offset="100%" stop-color="#ffff00" :stop-opacity="saturation"/>
</linearGradient>
<linearGradient id="yelgre" gradientUnits="objectBoundingBox" x1="0" y1="0" x2="0" y2="1">
<stop offset="0%" stop-color="#ffff00" :stop-opacity="saturation"/>
<stop offset="100%" stop-color="#00ff00" :stop-opacity="saturation"/>
</linearGradient>
<linearGradient id="grecya" gradientUnits="objectBoundingBox" x1="1" y1="0" x2="0" y2="1">
<stop offset="0%" stop-color="#00ff00" :stop-opacity="saturation"/>
<stop offset="100%" stop-color="#00ffff" :stop-opacity="saturation"/>
</linearGradient>
<linearGradient id="cyablu" gradientUnits="objectBoundingBox" x1="1" y1="1" x2="0" y2="0">
<stop offset="0%" stop-color="#00ffff" :stop-opacity="saturation"/>
<stop offset="100%" stop-color="#0000ff" :stop-opacity="saturation"/>
</linearGradient>
<linearGradient id="blumag" gradientUnits="objectBoundingBox" x1="0" y1="1" x2="0" y2="0">
<stop offset="0%" stop-color="#0000ff" :stop-opacity="saturation"/>
<stop offset="100%" stop-color="#ff00ff" :stop-opacity="saturation"/>
</linearGradient>
<linearGradient id="magred" gradientUnits="objectBoundingBox" x1="0" y1="1" x2="1" y2="0">
<stop offset="0%" stop-color="#ff00ff" :stop-opacity="saturation"/>
<stop offset="100%" stop-color="#ff0000" :stop-opacity="saturation"/>
</linearGradient>
</defs>
<g id="group" fill="none" stroke-width="1" transform="translate(100,100)">
<path d="M 0,-100 A 100,100 0 0,1 86.6,-50" stroke="url(#redyel)"/>
<path d="M 86.6,-50 A 100,100 0 0,1 86.6,50" stroke="url(#yelgre)"/>
<path d="M 86.6,50 A 100,100 0 0,1 0,100" stroke="url(#grecya)"/>
<path d="M 0,100 A 100,100 0 0,1 -86.6,50" stroke="url(#cyablu)"/>
<path d="M -86.6,50 A 100,100 0 0,1 -86.6,-50" stroke="url(#blumag)"/>
<path d="M -86.6,-50 A 100,100 0 0,1 0,-100" stroke="url(#magred)"/>
</g>
</svg>
我想要的是
CODE1 | CODE2 | TIME
00001 | AAA | 2019-01-01 14:00:00
00001 | AAA | 2019-01-01 14:05:00
00001 | AAA | 2019-01-01 14:10:00
00001 | BBB | 2019-01-01 14:15:00
00001 | BBB | 2019-01-01 14:20:00
00001 | AAA | 2019-01-01 14:25:00
00001 | AAA | 2019-01-01 14:30:00
时间差是从同一类别的最后一条记录到第一个记录。我已经按时间对信息进行了排序。 有可能吗?
答案 0 :(得分:0)
我用一种非常普通和体面的方法对其进行了编码。但是,可以利用spark中提供的更多内置功能来优化以下内容。
>>> df.show()
+-----+-----+-------------------+
|CODE1|CODE2| TIME|
+-----+-----+-------------------+
| 1| AAA|2019-01-01 14:00:00|
| 1| AAA|2019-01-01 14:05:00|
| 1| AAA|2019-01-01 14:10:00|
| 1| BBB|2019-01-01 14:15:00|
| 1| BBB|2019-01-01 14:20:00|
| 1| AAA|2019-01-01 14:25:00|
| 1| AAA|2019-01-01 14:30:00|
+-----+-----+-------------------+
>>> df.printSchema()
root
|-- CODE1: long (nullable = true)
|-- CODE2: string (nullable = true)
|-- TIME: string (nullable = true)
>>> from pyspark.sql import functions as F, Window
>>> win = Window.partitionBy(F.lit(0)).orderBy('TIME')
#batch_order column is to group CODE2 as per the ordered timestamp
>>> df_1=df.withColumn('prev_batch', F.lag('CODE2').over(win)) \
... .withColumn('flag', F.when(F.col('CODE2') == F.col('prev_batch'),0).otherwise(1)) \
... .withColumn('batch_order', F.sum('flag').over(win)) \
... .drop('prev_batch', 'flag') \
... .sort('TIME')
>>> df_1.show()
+-----+-----+-------------------+-----------+
|CODE1|CODE2| TIME|batch_order|
+-----+-----+-------------------+-----------+
| 1| AAA|2019-01-01 14:00:00| 1|
| 1| AAA|2019-01-01 14:05:00| 1|
| 1| AAA|2019-01-01 14:10:00| 1|
| 1| BBB|2019-01-01 14:15:00| 2|
| 1| BBB|2019-01-01 14:20:00| 2|
| 1| AAA|2019-01-01 14:25:00| 3|
| 1| AAA|2019-01-01 14:30:00| 3|
+-----+-----+-------------------+-----------+
#Extract min and max timestamps for each group
>>> df_max=df_1.groupBy([df_1.batch_order,df_1.CODE2]).agg(F.max("TIME").alias("mx"))
>>> df_min=df_1.groupBy([df_1.batch_order,df_1.CODE2]).agg(F.min("TIME").alias("mn"))
>>> df_max.show()
+-----------+-----+-------------------+
|batch_order|CODE2| mx|
+-----------+-----+-------------------+
| 1| AAA|2019-01-01 14:10:00|
| 2| BBB|2019-01-01 14:20:00|
| 3| AAA|2019-01-01 14:30:00|
+-----------+-----+-------------------+
>>> df_min.show()
+-----------+-----+-------------------+
|batch_order|CODE2| mn|
+-----------+-----+-------------------+
| 1| AAA|2019-01-01 14:00:00|
| 2| BBB|2019-01-01 14:15:00|
| 3| AAA|2019-01-01 14:25:00|
+-----------+-----+-------------------+
#join on batch_order
>>> df_joined=df_max.join(df_min,df_max.batch_order==df_min.batch_order)
>>> df_joined.show()
+-----------+-----+-------------------+-----------+-----+-------------------+
|batch_order|CODE2| mx|batch_order|CODE2| mn|
+-----------+-----+-------------------+-----------+-----+-------------------+
| 1| AAA|2019-01-01 14:10:00| 1| AAA|2019-01-01 14:00:00|
| 3| AAA|2019-01-01 14:30:00| 3| AAA|2019-01-01 14:25:00|
| 2| BBB|2019-01-01 14:20:00| 2| BBB|2019-01-01 14:15:00|
+-----------+-----+-------------------+-----------+-----+-------------------+
>>> from pyspark.sql.functions import unix_timestamp
>>> from pyspark.sql.types import IntegerType
#difference between the max and min timestamp
>>> df_joined.withColumn("diff",((unix_timestamp(df_joined.mx, 'yyyy-MM-dd HH:mm:ss')-unix_timestamp(df_joined.mn, 'yyyy-MM-dd HH:mm:ss'))/60).cast(IntegerType())).show()
+-----------+-----+-------------------+-------------------+----+
|batch_order|CODE2| mx| mn|diff|
+-----------+-----+-------------------+-------------------+----+
| 1| AAA|2019-01-01 14:10:00|2019-01-01 14:00:00| 10|
| 3| AAA|2019-01-01 14:30:00|2019-01-01 14:25:00| 5|
| 2| BBB|2019-01-01 14:20:00|2019-01-01 14:15:00| 5|
+-----------+-----+-------------------+-------------------+----+