Question

这是我的代码示例：

 case class Person(name:String,tel:String){
        def equals(that:Person):Boolean = that.name == this.name && this.tel == that.tel}

 val persons = Array(Person("peter","139"),Person("peter","139"),Person("john","111"))
 sc.parallelize(persons).distinct.collect

返回

 res34: Array[Person] = Array(Person(john,111), Person(peter,139), Person(peter,139))

为什么distinct不起作用？我希望结果为Person（“john”，111），Person（“peter”，139）

Answer 1

从@aaronman的观察进一步扩展，这个问题有一个解决方法。在RDD上，distinct有两个定义：

 /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] =
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(): RDD[T] = distinct(partitions.size)

从第一个distinct的签名中可以明显看出，必须存在元素的隐式排序，如果不存在则假定为null，这就是短版本.distinct()的作用。

案例类没有默认的隐式排序，但很容易实现一个：

case class Person(name:String,tel:String) extends Ordered[Person] {
  def compare(that: Person): Int = this.name compare that.name
}

现在，尝试相同的示例会提供预期的结果（请注意我正在比较名称）：

val ps5 = Array(Person("peter","138"),Person("peter","55"),Person("john","138"))
sc.parallelize(ps5).distinct.collect

res: Array[P5] = Array(P5(john,111), P5(peter,139))

请注意，案例类已经实现了equals和hashCode，因此提供的示例中的impl是不必要的，也是不正确的。 equals的正确签名是：equals(arg0: Any): Boolean - 顺便说一下，我首先想到的问题与错误的等号签名有关，这让我看错路径。

Answer 2

对我而言，问题与对象相等有关，正如Martin Odersky在Scala编程中所提到的（第30章），尽管我有一个普通的类（不是案例类）。对于正确的相等性测试，如果您有自定义equals（），则必须重新定义（覆盖）hashCode（）。您还需要一个canEqual（）方法才能获得100％的正确性。我没有查看RDD的实现细节，但由于它是一个集合，可能它使用HashSet或其他基于散列的数据结构的一些复杂/并行变体来比较对象并确保清晰度。

声明hashSet（），equals（），canEqual（）和compare（）方法解决了我的问题：

override def hashCode(): Int = {
  41 * (41 + name.hashCode) + tel.hashCode
}

override def equals(other: Any) = other match {
  case other: Person =>
    (other canEqual this) &&
    (this.name == other.name) && (this.tel == other.tel)
  case _ =>
    false
}

def canEqual(other: Any) = other.isInstanceOf[Person]

def compare(that: Person): Int = {
  this.name compare that.name
}

Answer 3

正如其他人所指出的，这是spark 1.0.0中的一个错误。我关于它来自何处的理论是，如果你看看1.0.0到9.0的差异，你会看到

-  def repartition(numPartitions: Int): RDD[T] = {
+  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = {

如果你跑

case class A(i:Int)
implicitly[Ordering[A]]

您收到错误

<console>:13: error: No implicit Ordering defined for A.
              implicitly[Ordering[A]]

所以我认为解决方法是为case类定义一个隐式排序，遗憾的是我不是scala专家但是answer seems to do it correctly

Apache Spark：明显不起作用？

3 个答案: