对象不可序列化kmeans模型spark mllib

时间:2018-02-21 19:27:21

标签: scala apache-spark apache-spark-mllib

我正在运行以下代码:

def calcClusteringScores(data: RDD[Vector], k: Int) : Double = {
  val model = KMeans.train(data=data, k, maxIterations = 1)
  data.map(datum => distanceToCentroid(datum, model)).mean()
}

KMeans.train返回KMeansModel(请参阅:here),它实现了可序列化,并且应该是可序列化的。

然而,当我运行data.map函数时,我得到一个object not serializable异常抱怨该模型。有没有办法解决这个问题,我错过了?

更新1

下面是distanceToCentroid方法,它调用距离。它计算2个向量之间的欧氏距离

def distanceToCentroid(datum: Vector, model: KMeansModel) : Double ={
    val cluster = model.predict(datum)
    val clusterCenter = model.clusterCenters(cluster)
    distance(datum, clusterCenter)                     
  }

def distance(a: Vector, b: Vector) : Double ={
  val a_arr = a.toArray
  val b_arr = b.toArray
  val pairs = a_arr.zip(b_arr)
  val sumOfSquares = pairs.map(pair => pair._1 - pair._2)
                          .map(diff => diff * diff)
                          .sum
  sqrt(sumOfSquares)
}

更新2

我通过将方法体从函数移动到main方法来修复序列化问题。我不再收到序列化错误,但我不知道为什么。有人有什么想法吗?

def testSerialiseModel() ={
    val sparkConf     = new SparkConf().setAppName("ModelTest").setMaster("local")
    val sc            = new SparkContext(sparkConf)
    val sparkSession  = SparkSession.builder().getOrCreate()

    val data = sc.parallelize(Array(
      Vectors.dense(Array(1.0, 2.0, 3.0)),
      Vectors.dense(Array(1.0, 1.8, 2.3)),
      Vectors.dense(Array(2.0, 1.5, 3.0))
   ))

    val model = KMeans.train(data=data, 2, maxIterations = 1)

    val score = data.map{datum =>
      val cluster = model.predict(datum)
      val clusterCenter = model.clusterCenters(cluster)

      val pairs = datum.toArray.zip(clusterCenter.toArray)
      val sumOfSquares = pairs.map(pair => pair._1 - pair._2)
                        .map(diff => diff * diff)
                        .sum
      sqrt(sumOfSquares)

    }.mean()

    println(s"clustering score: ${score}")
}

1 个答案:

答案 0 :(得分:0)

这些函数位于scala python内,根据@ user322778链接的帖子的建议将类更改为对象解决了问题。该类没有任何实例变量,因此更改为对象是微不足道的。