使用spark运行KMeans集群,程序块?

时间:2014-03-31 05:00:17

标签: scala k-means apache-spark

当我使用apache spark scala API运行KMeans集群时。我的计划如下:

object KMeans {

def closestPoint(p: Vector, centers: Array[Vector]) = {
    var index = 0
    var bestIndex = 0
    var closest = Double.PositiveInfinity
    for(i <- 0 until centers.length) {
      var tempDist = p.squaredDist(centers(i))
      if(tempDist < closest) {
        closest = tempDist
        bestIndex = i
    }
  }
   bestIndex
}

def parseVector(line: String): Vector = {
   new Vector(line.split("\\s+").map(s => s.toDouble))
}

 def main(args: Array[String]): Unit = {}
System.setProperty("hadoop.home.dir", "F:/OpenSoft/hadoop-2.2.0")
val sc = new SparkContext("local", "kmeans cluster", 
        "G:/spark-0.9.0-incubating-bin-hadoop2",
    SparkContext.jarOfClass(this.getClass()))
val lines = sc.textFile("G:/testData/synthetic_control.data.txt")   // RDD[String]
val count = lines.count
val data = lines.map(parseVector _)    // RDD[Vector]
data.foreach(println)
val K = 6
val convergeDist = 0.1
val kPoint = data.takeSample(withReplacement = false, K, 42)  // Array[Vector]
kPoint.foreach(println)
var tempDist = 1.0
while(tempDist > convergeDist) {
    val closest = data.map(p => (closestPoint(p, kPoint), (p, 1)))
    val pointStat = closest.reduceByKey{case ((x1, y1), (x2, y2)) => 
                 (x1+x2, y1+y2)}
    val newKPoint = pointStat.map{pair => (
                    pair._1,pair._2._1/pair._2._2)}.collectAsMap()
    tempDist = 0.0
    for(i <- 0 until K) {
      tempDist += kPoint(i).squaredDist(newKPoint(i))
    }
    for(newP <- newKPoint) {
      kPoint(newP._1) = newP._2
    }
    println("Finish iteration (delta=" + tempDist + ")")
}
println("Finish centers: ")
kPoint.foreach(println)
System.exit(0)

}

当我将run作为本地模式应用时,日志信息如下: .................. 14/03/31 11:29:15 INFO HadoopRDD:输入拆分:hdfs:// hadoop-01:9000 / data / synthetic_control.data:0 + 288374

程序开始阻止,没有继续运行........

任何人都可以帮助我吗?

0 个答案:

没有答案