窗口函数中的火花过滤行

时间:2019-05-09 17:13:11

标签: pyspark

我需要应用一个窗口函数是PySpark,但在执行此操作时必须忽略某些行。

我尝试了以下代码。

from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = (sc.parallelize([
            {"id":"900","service":"MM", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-09-13 13:38:17.229" },
            {"id":"900","service":"MM", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-09-13 13:38:17.242" },
            {"id":"1527","service":"RA", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-10-17 14:52:02.331" },
            {"id":"1527","service":"RT", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-10-17 14:52:02.490" },
            {"id":"1527","service":"RP", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-10-17 14:52:02.647" },
            {"id":"1504","service":"RA", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-10-17 22:28:25.095" },
            {"id":"1504","service":"RT", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-10-17 22:28:25.253" },
            {"id":"1504","service":"RP", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-10-17 22:28:25.372" },
            {"id":"1504","service":"RV", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-10-17 22:28:25.732" },
            {"id":"1504","service":"RA", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-09 02:05:53.445" },
            {"id":"1504","service":"MT", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-09 02:05:53.643" },
            {"id":"1504","service":"RA", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-09 02:05:53.924" },
            {"id":"1504","service":"RT", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-09 02:05:54.094" },
            {"id":"1504","service":"RP", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-09 02:05:54.243" },
            {"id":"1504","service":"RV", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-09 02:05:54.732" },
            {"id":"1504","service":"RA", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-11 20:52:30.764" },
            {"id":"1504","service":"RV", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-11 20:52:31.099" },
            {"id":"1504","service":"RT", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-11 20:52:33.363" },
            {"id":"1504","service":"RV", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-11 20:52:33.677" },
            {"id":"1504","service":"RP", "guid":"43158A8E-3DF2-4FD2-90C9-B73411BBE683" ,"time":"2018-11-11 20:52:39.572" }

 ]).toDF()

      )
(
    df    
    .withColumn
    (
        'rank',
        F.when
        (
            (F.col('id') != 900),
            F.row_number()  
            .over
            (
                Window.partitionBy
                (
                    #F.when
                    #(
                    # (
                    # (F.col('id') != 90000)
                    #),
                    F.col('guid')
                #)
                )
                .orderBy
                (
                    F.col('time').asc()
                )
            )
        )
    )
    .select
    (
        'id',
        'service',
        'guid',
        'time',
        'rank'

    )
    .show(truncate = False)
)

enter image description here

我几乎拥有它,但是在这种情况下,row_numbers必须从1开始而不是3。 因此,在等级列中,两个空值后的数字应为1而不是3。

1 个答案:

答案 0 :(得分:0)

IIUC,您只需要添加一个临时分区列,其值类似于id == 900 ? 0 : 1

from pyspark.sql import Window, functions as F

# add `part` into partitionBy: (partition based on if id is 900)
win = Window.partitionBy('guid','part').orderBy('time')

# define part and then calculate rank
df = df.withColumn('part', F.when(df.id == 900, 0).otherwise(1)) \
       .withColumn('rank', F.when(F.col('part')==1, F.row_number().over(win))) \
       .drop('part')