我正在尝试将两个数据框与Pyspark进行比较,但是对于相同的数据框却得到了不同的结果。
df1Original = spark.sql("SELECT \
BOUNCES.SUBSCRIBERID, \
BOUNCES.SENDID, \
COUNT(CASE WHEN BOUNCES.EVENTTYPE = 'Bounce' THEN 1 ELSE 0 END) AS NUM_ACC_REBOTADOS \
FROM mytable BOUNCES \
GROUP BY BOUNCES.SUBSCRIBERID, BOUNCES.SENDID")
df1Modified = spark.sql("SELECT \
BOUNCES.SUBSCRIBERID, \
BOUNCES.SENDID, \
COUNT(CASE WHEN BOUNCES.EVENTTYPE = 'Bounce' THEN 1 ELSE 0 END) AS NUM_ACC_REBOTADOS \
FROM mytable BOUNCES \
GROUP BY BOUNCES.SUBSCRIBERID, BOUNCES.SENDID")
print(df1Original.subtract(df1Modified).count())
print(df1Original.subtract(df1Modified).count())
我得到了不同的结果,为什么?两种情况下都假定为0。
(2)Spark Jobs
0
100