运行两次时,Spark Dataframe计数不会返回相同的结果

时间:2016-04-05 23:35:10

标签: scala apache-spark apache-spark-sql yarn spark-dataframe

使用Spark 1.5.1,Hive 1.2.1

当我在spark-shell --master yarn --deploy-mode client下运行此代码段时:

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

var queryLeft = "SELECT t1.* FROM (SELECT t2.*, row_number() over (PARTITION BY CAST(TRIM(t2.pk) as DECIMAL(31,8)) ORDER BY t2.create_dt DESC) AS R FROM myschema.mytable t2 WHERE t2.part_dt='mydate' AND t2.part_seq='myseq') t1 WHERE t1.R = 1"

val dfLeft = hiveContext.sql(queryLeft)
val firstCount = dfLeft.count
val secondCount = dfLeft.count

我得到这个结果,这两个都是错误的(并且不相等!!)

scala> print (firstCount, secondCount)
(1865,2373)

当我在spark-shell下运行相同的代码段时,我得到了正确的结果

scala> print (firstCount, secondCount)
(2395,2395)

我有什么不对的吗?

0 个答案:

没有答案