StreamingKmeansExample无法正常工作

时间:2018-08-18 16:33:05

标签: scala apache-spark machine-learning k-means

我对streamingkmeansexample有问题。对于我使用的每个测试数据,结果簇索引始终为零。该代码是原始的。

import org.apache.spark.SparkConf
import org.apache.spark.mllib.clustering.StreamingKMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.streaming.{Seconds, StreamingContext}

object App {
def main(args: Array[String]) {
if (args.length != 5) {
  System.err.println(
    "Usage: StreamingKMeansExample " +
      "<trainingDir> <testDir> <batchDuration> <numClusters> <numDimensions>")
  System.exit(1)
}
// $example on$
val conf = new SparkConf().setAppName("StreamingKMeansExample")
val ssc = new StreamingContext(conf, Seconds(args(2).toLong))

val trainingData = ssc.textFileStream(args(0)).map(Vectors.parse)
val testData = ssc.textFileStream(args(1)).map(labeledPoint.parse)

val model = new StreamingKMeans()
  .setK(args(3).toInt)
  .setDecayFactor(1.0)
  .setRandomCenters(args(4).toInt, 0.0)

model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()

ssc.start()
ssc.awaitTermination()
// $example off$
}
}

sbt是这样的:

name := "myApp"
version := "0.1"
scalaVersion := "2.11.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.2.0",
"org.apache.spark" %% "spark-mllib" % "2.2.0",
"com.github.fommil.netlib" % "all" % "1.1.2" pomOnly()
)

因此,我将输入一个用于训练的txt文件和一个用于测试的txt文件作为输入,它们看起来像这样: 测试数据是这样的:

(2,[9.26,68.19])
(1,[3.27,9.14])

火车数据是这样的:

[36.72,67.44]
[92.20,11.81]

现在spark-submit命令是这样的:

 $SPARK_HOME/bin/spark-submit --master local[2] --class "App" ./target/scala-2.11/myapp_2.11-0.1.jar "./train" "./test" 5 10 2

我认为一切都很好,但是它在群集零中分配了每个测试数据。有什么想法吗?

0 个答案:

没有答案