我正在运行以下代码:
def calcClusteringScores(data: RDD[Vector], k: Int) : Double = {
val model = KMeans.train(data=data, k, maxIterations = 1)
data.map(datum => distanceToCentroid(datum, model)).mean()
}
KMeans.train
返回KMeansModel
(请参阅:here),它实现了可序列化,并且应该是可序列化的。
然而,当我运行data.map
函数时,我得到一个object not serializable
异常抱怨该模型。有没有办法解决这个问题,我错过了?
下面是distanceToCentroid方法,它调用距离。它计算2个向量之间的欧氏距离
def distanceToCentroid(datum: Vector, model: KMeansModel) : Double ={
val cluster = model.predict(datum)
val clusterCenter = model.clusterCenters(cluster)
distance(datum, clusterCenter)
}
def distance(a: Vector, b: Vector) : Double ={
val a_arr = a.toArray
val b_arr = b.toArray
val pairs = a_arr.zip(b_arr)
val sumOfSquares = pairs.map(pair => pair._1 - pair._2)
.map(diff => diff * diff)
.sum
sqrt(sumOfSquares)
}
我通过将方法体从函数移动到main方法来修复序列化问题。我不再收到序列化错误,但我不知道为什么。有人有什么想法吗?
def testSerialiseModel() ={
val sparkConf = new SparkConf().setAppName("ModelTest").setMaster("local")
val sc = new SparkContext(sparkConf)
val sparkSession = SparkSession.builder().getOrCreate()
val data = sc.parallelize(Array(
Vectors.dense(Array(1.0, 2.0, 3.0)),
Vectors.dense(Array(1.0, 1.8, 2.3)),
Vectors.dense(Array(2.0, 1.5, 3.0))
))
val model = KMeans.train(data=data, 2, maxIterations = 1)
val score = data.map{datum =>
val cluster = model.predict(datum)
val clusterCenter = model.clusterCenters(cluster)
val pairs = datum.toArray.zip(clusterCenter.toArray)
val sumOfSquares = pairs.map(pair => pair._1 - pair._2)
.map(diff => diff * diff)
.sum
sqrt(sumOfSquares)
}.mean()
println(s"clustering score: ${score}")
}
答案 0 :(得分:0)
这些函数位于scala python
内,根据@ user322778链接的帖子的建议将类更改为对象解决了问题。该类没有任何实例变量,因此更改为对象是微不足道的。