在 Pyspark 中的多列上过滤具有多个条件的火花数据框

时间:2021-01-07 04:29:54

标签: python dataframe filter pyspark apache-spark-sql

我想在 Pyspark 中实现以下 SQL 条件

SELECT *
            FROM   table
            WHERE  NOT ( ID = 1
                         AND Event = 1 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 2 
                       ) 
               AND NOT ( ID = 1 
                         AND Event = 0 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 0 
                       ) 

这样做的干净方法是什么?

2 个答案:

答案 0 :(得分:2)

对于DataFrame API 版本,您使用filterwhere 函数。

等效代码如下:

df.filter(~((df.ID == 1) & (df.Event == 1)) & 
          ~((df.ID == 2) & (df.Event == 2)) & 
          ~((df.ID == 1) & (df.Event == 0)) &
          ~((df.ID == 2) & (df.Event == 0)))

答案 1 :(得分:1)

如果你很懒,你可以将 SQL 过滤器表达式复制并粘贴到 pyspark 过滤器中:

df.filter("""
               NOT ( ID = 1
                         AND Event = 1 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 2 
                       ) 
               AND NOT ( ID = 1 
                         AND Event = 0 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 0 
                       ) 
""")