如何比较两个Spark数据帧?

时间:2019-10-01 08:41:33

标签: apache-spark pyspark pyspark-dataframes

我正在尝试将两个数据框与Pyspark进行比较,但是对于相同的数据框却得到了不同的结果。

df1Original = spark.sql("SELECT \
                                    BOUNCES.SUBSCRIBERID, \
                                    BOUNCES.SENDID, \
                                    COUNT(CASE WHEN BOUNCES.EVENTTYPE = 'Bounce' THEN 1 ELSE 0 END) AS NUM_ACC_REBOTADOS \
                            FROM mytable BOUNCES \
                            GROUP BY BOUNCES.SUBSCRIBERID, BOUNCES.SENDID")

df1Modified = spark.sql("SELECT \
                                BOUNCES.SUBSCRIBERID, \
                                BOUNCES.SENDID, \
                                COUNT(CASE WHEN BOUNCES.EVENTTYPE = 'Bounce' THEN 1 ELSE 0 END) AS NUM_ACC_REBOTADOS \
                            FROM mytable  BOUNCES \
                            GROUP BY BOUNCES.SUBSCRIBERID, BOUNCES.SENDID")

print(df1Original.subtract(df1Modified).count())
print(df1Original.subtract(df1Modified).count())

我得到了不同的结果,为什么?两种情况下都假定为0。

  

(2)Spark Jobs

     

0

     

100

0 个答案:

没有答案