计算向量与K均值聚类中心

时间:2017-10-25 03:43:05

标签: scala apache-spark spark-dataframe rdd apache-spark-mllib

我有训练数据集,我在K = 4上运行了K-means,得到了四个集群中心。对于新的数据点,我不仅想知道预测的集群,还想知道该集群中心的距离。是否有API来计算距离中心的欧氏距离?如果需要,我可以进行2次API调用。我正在使用Scala,我无法在任何地方找到任何示例。

2 个答案:

答案 0 :(得分:3)

由于Spark 2.0 Vectors.sqdist可用于计算两个向量之间的平方距离。

您可以使用UDF计算每个点与其中心的距离,如下所示:

import org.apache.spark.ml.linalg.{Vectors, Vector}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.sql.functions.udf

// Sample points
val points = Seq(Vectors.dense(1,0), Vectors.dense(2,-3), Vectors.dense(0.5, -1), Vectors.dense(1.5, -1.5))    
val df = points.map(Tuple1.apply).toDF("features")

// K-means
val kmeans = new KMeans()
  .setFeaturesCol("features")
  .setK(2)
val kmeansModel = kmeans.fit(df)

val predictedDF = kmeansModel.transform(df)
// predictedDF.schema = (features: Vector, prediction: Int)

// Cluster Centers
kmeansModel.clusterCenters foreach println
/*
[1.75,-2.25]
[0.75,-0.5]
*/

// UDF that calculates for each point distance from each cluster center
val distFromCenter = udf((features: Vector, c: Int) => Vectors.sqdist(features, kmeansModel.clusterCenters(c)))

val distancesDF = predictedDF.withColumn("distanceFromCenter", distFromCenter($"features", $"prediction"))
distancesDF.show(false)
/*
+----------+----------+------------------+
|features  |prediction|distanceFromCenter|
+----------+----------+------------------+
|[1.0,0.0] |1         |0.3125            |
|[2.0,-3.0]|0         |0.625             |
|[0.5,-1.0]|1         |0.3125            |
|[1.5,-1.5]|0         |0.625             |
+----------+----------+------------------+
*/

注意:Vectors.sqdist计算2个矢量之间的平方距离(没有平方根)。如果您需要欧几里德距离,可以使用Math.sqrt(Vectors.sqdist(...))

答案 1 :(得分:0)

以下对我有用......

def EuclideanDistance(x: Array[Double], y: Array[Double]) = {
  scala.math.sqrt((xs zip ys).map { case (x,y) => scala.math.pow(y - x, 2.0) }.sum)
}