I'm experiencing a strange behavior when I try to use JavaRDD subtract to compare 2 DataFrames.
This is what I'm doing: I try to compare if 2 DataFrame (A,B) is equals by converting them to JavaRDD and than subtract A from B and B from A. If they are equals (contains the same data) than both result should be an empty JavaRDD.
I did not get empty result:
DataFrame A = someFunctionRespondWithDF(param);
DataFrame B = sqlContext.read().json("src/test/resources/expected/exp.json");
Assert.assertTrue(B.toJavaRDD().subtract(A.toJavaRDD()).isEmpty());
Assert.assertTrue(A.toJavaRDD().subtract(B.toJavaRDD()).isEmpty());
...assert fails
If I write the data to disk and read it back to another Dataframe, than it's fine.
A.write().json("target/result.json");
DataFrame AA = sqlContext.read().json("target/result.json");
Assert.assertTrue(B.toJavaRDD().subtract(AA.toJavaRDD()).isEmpty());
Assert.assertTrue(AA.toJavaRDD().subtract(B.toJavaRDD()).isEmpty());
...assert true
I also tried to enforce the evaluation by call the count(), cache() or persist() function on the DataFrame (based on this answer) but no success.
DataFrame AAA = A.cache();
Assert.assertTrue(B.toJavaRDD().subtract(AAA.toJavaRDD()).isEmpty();
Assert.assertTrue(AAA.toJavaRDD().subtract(B.toJavaRDD()).isEmpty();
Is there anybody experienced the same? What do I miss here?
Spark version: 1.6.1
答案 0 :(得分:1)
好的,我可以回答我自己的问题:
断言失败的原因是当我从json读取DataFrame时,类型不同。假设我在原始DataFrame中有一个Integer,在从json读取它之后(!没有模式文件)它将是一个Long。 解决方案 - >使用描述模式的格式,如avro。