我有一个包含事件的Spark数据帧(Pyspark 2.2.0),每个事件都有一个时间戳。另有一列包含一系列标签(A,B,C或Null)。我想为每一行(按事件组,按时间戳记排序)计算非Null标签当前最长变化的计数(Null应将此计数重置为0)。 df的示例,其中我的理想计算列称为Stretch:
event timestamp tag stretch
G1 09:59:00 Null 0
G1 10:00:00 A 1 ---> first non Null tag starts the count
G1 10:01:00 A 1 ---> no change of tag
G1 10:02:00 B 2 ---> change of tag (A to B)
G1 10:03:00 A 3 ---> change of tag (B to A)
G1 10:04:00 Null 0 ---> Null resets the count
G1 10:05:00 A 1 ---> first non Null tag restarts the count
G2 10:00:00 B 1 ---> first non Null tag starts the count
G2 10:01:00 C 2 ---> change of tag (B to C)
在Pyspark中,我可以这样定义一个窗口:
window = Window.partitionBy("event").orderBy(col("timestamp").asc())
并计算例如标签的更改:
df = df.withColumn("change_of_tag",col("tag")!=lag("tag",1).over(window))
但是我找不到如何计算这些更改的累积总和,这些更改将在每次遇到Null标签时重置。我怀疑应该定义一个按事件和标记类型(空或非空)划分的新窗口,但是我不知道如何按事件,时间戳顺序和此后按标记类型分组。
答案 0 :(得分:1)
我认为这是非常棘手的情况。特别是在一个过程中很难处理“不更改标签”的情况。因此,您可以在下面找到我的解决方案。我必须创建一些新的计算列以获得结果
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.window import Window
>>>
>>> df.show()
+-----+---------+----+
|event|timestamp| tag|
+-----+---------+----+
| G1| 09:59:00|null|
| G1| 10:00:00| A|
| G1| 10:01:00| A|
| G1| 10:02:00| B|
| G1| 10:03:00| A|
| G1| 10:04:00|null|
| G1| 10:05:00| A|
| G2| 10:00:00| B|
| G2| 10:01:00| C|
+-----+---------+----+
>>> df = df.withColumn('new_col1', F.when(F.isnull('tag'),1).otherwise(0))
>>>
>>> window1 = Window.partitionBy('event').orderBy('timestamp')
>>>
>>> df = df.withColumn('new_col2', F.row_number().over(window1))
>>> df = df.withColumn('new_col3', F.lag('tag').over(window1))
>>> df = df.withColumn('new_col4', F.lag('new_col2').over(window1))
>>> df = df.withColumn('new_col4', F.when(df['new_col3']==df['tag'],df['new_col4']).otherwise(df['new_col2']))
>>> df = df.withColumn('new_col5', F.sum('new_col1').over(window1))
>>> df = df.withColumn('new_col5', F.when(F.isnull('tag'),None).otherwise(df['new_col5']))
>>>
>>> window2 = Window.partitionBy('event','new_col5').orderBy('new_col4')
>>>
>>> df = df.withColumn('new_col6', F.when(F.isnull('tag'),0).otherwise(F.dense_rank().over(window2)))
>>> df = df.select('event','timestamp','tag', df['new_col6'].alias('stretch'))
>>>
>>> df.sort(["event", "timestamp"], ascending=[1, 1]).show()
+-----+---------+----+-------+
|event|timestamp| tag|stretch|
+-----+---------+----+-------+
| G1| 09:59:00|null| 0|
| G1| 10:00:00| A| 1|
| G1| 10:01:00| A| 1|
| G1| 10:02:00| B| 2|
| G1| 10:03:00| A| 3|
| G1| 10:04:00|null| 0|
| G1| 10:05:00| A| 1|
| G2| 10:00:00| B| 1|
| G2| 10:01:00| C| 2|
+-----+---------+----+-------+
答案 1 :(得分:0)
修改并修复了代码:
df = spark.createDataFrame([\
("G1", 113, "-1"),("G1", 114, "A"),("G1", 115, "A"),("G1", 116, "A"),\
("G1", 117, "B"),("G1", 118, "A"),("G1", 119, "-1"),\
("G1", 120, "A"),("G2", 121, "B"),("G2", 122, "C")],["event","timestamp","tag"])
df = df.withColumn("tag",when(col("tag")=="-1",lit(None)).otherwise(col("tag")))
window_trip = Window.partitionBy('event').orderBy('timestamp')
df = df.withColumn('in_out', when(\
(row_number().over(window_trip)>1) &
( ( (col('tag').isNull()) & (lag('tag').over(window_trip).isNotNull())) \
| ( (col('tag').isNotNull()) & (lag('tag').over(window_trip).isNull()) \
) \
),1) \
.otherwise(0))
df = df.withColumn('group', sum('in_out').over(window_trip))
df = df.withColumn('tag_change', ((( (col('tag')!=lag('tag').over(window_trip)) ) | (row_number().over(window_trip)==1))).cast("int") )
df = df.withColumn('tag_rank', sum('tag_change').over(window_trip) )
window2 = Window.partitionBy('event','group').orderBy('tag_rank')
df = df.withColumn('stretch', when(col('tag').isNull(),0).otherwise(dense_rank().over(window2)))
df.sort(["event", "timestamp"], ascending=[1, 1]).show()
再次感谢@AliYesilli,您给了我提示和fense_rank fct!