我尝试了这个简单的例子
scala> rdd2.collect
res45: Array[Person] = Array(Person(Mary,28,New York), Person(Bill,17,Philadelphia), Person(Craig,34,Philadelphia), Person(Leah,26,Rochester))
scala> rdd3.collect
res44: Array[Person] = Array(Person(Mary,28,New York), Person(Bill,17,Philadelphia), Person(Craig,35,Philadelphia), Person(Leah,26,Rochester))
scala> rdd2.subtract(rdd3).collect
res46: Array[Person] = Array(Person(Mary,28,New York), Person(Leah,26,Rochester), Person(Bill,17,Philadelphia), Person(Craig,34,Philadelphia))
我希望rdd2.subtract(rdd3).collect
只应Person(Craig,34,Philadelphia)
,但我得到rdd2作为我的输出任何人都可以解释一下吗?
答案 0 :(得分:0)
这是scala REPL的已知问题之一,其中相等条件在REPL中无法正常工作。尝试以下方法来解决它。此问题仅在REPL中发生,并且当您通过spark-submit运行应用程序时会消失。 此问题已在此ticket中详细说明。
scala> :paste -raw // make sure you are using Scala 2.11 for the raw option to work.
// Entering paste mode (ctrl-D to finish)
package mytest;
case class Person(name: String, age: Int, city: String);
// Exiting paste mode, now interpreting.
scala> import mytest.Person
scala> val rdd2 = sc.parallelize(Seq(Person("Mary",28,"New York"), Person("Bill",17,"Philadelphia"), Person("Craig",34,"Philadelphia"), Person("Leah",26,"Rochester")))
rdd2: org.apache.spark.rdd.RDD[mytest.Person] = ParallelCollectionRDD[6] at parallelize at <console>:25
scala> val rdd3 = sc.parallelize(Seq(Person("Mary",28,"New York"), Person("Bill",17,"Philadelphia"), Person("Craig",35,"Philadelphia"), Person("Leah",26,"Rochester")))
rdd3: org.apache.spark.rdd.RDD[mytest.Person] = ParallelCollectionRDD[7] at parallelize at <console>:25
scala> rdd2.subtract(rdd3).collect
res1: Array[mytest.Person] = Array(Person(Craig,34,Philadelphia))