窗口功能(滞后,领先)在pyspark实现?

时间:2019-02-21 06:54:42

标签: tsql pyspark window pyspark-sql lead

下面是附加的T-SQL代码。我尝试使用还附带的窗口函数将其转换为pyspark。

case 
           when eventaction = 'IN' and lead(eventaction,1) over (PARTITION BY barcode order by barcode,eventdate,transactionid) in('IN','OUT') 
                then                   lead(eventaction,1) over (PARTITION BY barcode order by barcode,eventdate,transactionid) 
           else ''
      end as next_action

Pyspark代码使用窗口函数Lead给出错误

Tgt_df = Tgt_df.withColumn((('Lead', lead('eventaction').over(Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate")) == 'IN' )|
                    ('1', lead('eventaction').over(Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate")) == 'OUT')
                     , (lead('eventaction').over(Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate"))).otherwise('').alias("next_action")))

但是它不起作用。该怎么办!?

1 个答案:

答案 0 :(得分:0)

withColumn方法应用作df.withColumn('name_of_col', value_of_column),这就是为什么您出错的原因。

在您的T-SQL请求中,相应的pyspark代码应为:

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate")

Tgt_df = Tgt_df.withColumn('next_action',
                           F.when((F.col('event_action')=='IN')&(F.lead('event_action', 1).over(w).isin(['IN', 'OUT'])),
                                  F.lead('event_action', 1).over(w)
                                  ).otherwise('')
                           )