我今天观察到了这种现象。当我在Hive CLI中执行以下命令时,与使用pyspark进行操作相比,我得到了一些不同的东西:
蜂巢:
Select count(distinct t1.fieldX) from table1 t1 JOIN table2 t2 ON (t1.fieldX=t2.fieldX AND t1.fieldY=t2.fieldY);
结果:17488
SparkSQL:
hc.sql("Select count(distinct t1.fieldX) from table1 t1 JOIN table2 t2 ON (t1.fieldX==t2.fieldX AND t1.fieldY==t2.fieldY)")
结果:5555
我使用此代码获得相同的结果:
tabl1.alias("t1").join(
other=table2.alias("t2"),
on=[t1.fieldX==t2.fieldX, t1.fieldY==t2.fieldY]
how='inner'
).select("fieldX").distinct().count()
结果:5555
我不明白为什么我得到不同的结果!