Question

我＆＃39;，在编辑其中一个类中的代码后尝试在本地构建mllib spark模块。

我已阅读此解决方案： How to build Spark Mllib submodule individually 但是当我使用maven构建模块时，结果.jar就像存储库中的版本一样，而且类没有我的代码。

我修改了BisectingKmeans.scala类，因为在github https://github.com/apache/spark的一个pull requesto中执行的修复之一在最后一次发布喷射中不是。

我试图构建的版本：

mllib 2.11
spark: 2.1.0

我需要更改BisectingKameans.scala类：

  /**
   * Updates assignments.
   * @param assignments current assignments
   * @param divisibleIndices divisible cluster indices
   * @param newClusterCenters new cluster centers
   * @return new assignments
   */
  private def updateAssignments(
      assignments: RDD[(Long, VectorWithNorm)],
      divisibleIndices: Set[Long],
      newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, VectorWithNorm)] = {
    assignments.map { case (index, v) =>
      if (divisibleIndices.contains(index)) {
        val children = Seq(leftChildIndex(index), rightChildIndex(index))
        val selected = children.minBy { child =>
          KMeans.fastSquaredDistance(newClusterCenters(child), v)
        }
        (selected, v)
      } else {
        (index, v)
      }
    }
  }

对此：

  /**
   * Updates assignments.
   * @param assignments current assignments
   * @param divisibleIndices divisible cluster indices
   * @param newClusterCenters new cluster centers
   * @return new assignments
   */
  private def updateAssignments(
      assignments: RDD[(Long, VectorWithNorm)],
      divisibleIndices: Set[Long],
      newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, VectorWithNorm)] = {
    assignments.map { case (index, v) =>
      if (divisibleIndices.contains(index)) {
        val children = Seq(leftChildIndex(index), rightChildIndex(index))
        val newClusterChildren = children.filter(newClusterCenters.contains(_))
        if (newClusterChildren.nonEmpty) {
          val selected = newClusterChildren.minBy { child =>
            KMeans.fastSquaredDistance(newClusterCenters(child), v)
          }
          (selected, v)
        } else {
          (index, v)
        }
      } else {
        (index, v)
      }
    }
  }

并建立。但我不知道该怎么做。

Answer 1

我会创建另一个扩展您要修改的类的类：

class MyBisectingKmeans extends BisectingKMeans {
  override private def updateAssignments(
                                 assignments: RDD[(Long, VectorWithNorm)],
                                 divisibleIndices: Set[Long],
                                 newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, VectorWithNorm)] = {
    assignments.map { case (index, v) =>
      if (divisibleIndices.contains(index)) {
        val children = Seq(leftChildIndex(index), rightChildIndex(index))
        val newClusterChildren = children.filter(newClusterCenters.contains(_))
        if (newClusterChildren.nonEmpty) {
          val selected = newClusterChildren.minBy { child =>
            KMeans.fastSquaredDistance(newClusterCenters(child), v)
          }
          (selected, v)
        } else {
          (index, v)
        }
      } else {
        (index, v)
      }
    }
  }
}

此方法中使用的某些类是私有的（例如，VectorWithNorm），因此，为了访问本地项目中的类，您可以创建一个包含路径org.apache.spark.mllib.clustering的包并复制该类的代码在那里，所以你的新BisectingKMeans可以访问它。

我已经使用其他Spark MLlib类完成了它，我还没有尝试过这个特定的类，所以我不确定在这种情况下是否还有一些更具体的细节需要考虑（所需的私有方法或类应该在本地重复，但是对于这样的小改动，不应该需要那么多的代码。）

这样做你不需要重建Spark MLlib。

在本地修改和构建spark-mllib

1 个答案: