RDD减法用于用户定义的类型

时间:2017-06-05 07:30:46

标签: apache-spark

我尝试了这个简单的例子

 scala> rdd2.collect
    res45: Array[Person] = Array(Person(Mary,28,New York), Person(Bill,17,Philadelphia), Person(Craig,34,Philadelphia), Person(Leah,26,Rochester))

    scala> rdd3.collect
    res44: Array[Person] = Array(Person(Mary,28,New York), Person(Bill,17,Philadelphia), Person(Craig,35,Philadelphia), Person(Leah,26,Rochester))

    scala> rdd2.subtract(rdd3).collect
    res46: Array[Person] = Array(Person(Mary,28,New York), Person(Leah,26,Rochester), Person(Bill,17,Philadelphia), Person(Craig,34,Philadelphia))

我希望rdd2.subtract(rdd3).collect只应Person(Craig,34,Philadelphia),但我得到rdd2作为我的输出任何人都可以解释一下吗?

1 个答案:

答案 0 :(得分:0)

这是scala REPL的已知问题之一,其中相等条件在REPL中无法正常工作。尝试以下方法来解决它。此问题仅在REPL中发生,并且当您通过spark-submit运行应用程序时会消失。 此问题已在此ticket中详细说明。

scala> :paste -raw  // make sure you are using Scala 2.11 for the raw option to work.
// Entering paste mode (ctrl-D to finish)

package mytest;
case class Person(name: String, age: Int, city: String);

// Exiting paste mode, now interpreting.

scala> import mytest.Person

scala> val rdd2 = sc.parallelize(Seq(Person("Mary",28,"New York"), Person("Bill",17,"Philadelphia"), Person("Craig",34,"Philadelphia"), Person("Leah",26,"Rochester")))
rdd2: org.apache.spark.rdd.RDD[mytest.Person] = ParallelCollectionRDD[6] at parallelize at <console>:25


scala> val rdd3 = sc.parallelize(Seq(Person("Mary",28,"New York"), Person("Bill",17,"Philadelphia"), Person("Craig",35,"Philadelphia"), Person("Leah",26,"Rochester")))
rdd3: org.apache.spark.rdd.RDD[mytest.Person] = ParallelCollectionRDD[7] at parallelize at <console>:25

scala> rdd2.subtract(rdd3).collect
res1: Array[mytest.Person] = Array(Person(Craig,34,Philadelphia))