我对streamingkmeansexample有问题。对于我使用的每个测试数据,结果簇索引始终为零。该代码是原始的。
import org.apache.spark.SparkConf
import org.apache.spark.mllib.clustering.StreamingKMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.streaming.{Seconds, StreamingContext}
object App {
def main(args: Array[String]) {
if (args.length != 5) {
System.err.println(
"Usage: StreamingKMeansExample " +
"<trainingDir> <testDir> <batchDuration> <numClusters> <numDimensions>")
System.exit(1)
}
// $example on$
val conf = new SparkConf().setAppName("StreamingKMeansExample")
val ssc = new StreamingContext(conf, Seconds(args(2).toLong))
val trainingData = ssc.textFileStream(args(0)).map(Vectors.parse)
val testData = ssc.textFileStream(args(1)).map(labeledPoint.parse)
val model = new StreamingKMeans()
.setK(args(3).toInt)
.setDecayFactor(1.0)
.setRandomCenters(args(4).toInt, 0.0)
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()
// $example off$
}
}
sbt是这样的:
name := "myApp"
version := "0.1"
scalaVersion := "2.11.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.2.0",
"org.apache.spark" %% "spark-mllib" % "2.2.0",
"com.github.fommil.netlib" % "all" % "1.1.2" pomOnly()
)
因此,我将输入一个用于训练的txt文件和一个用于测试的txt文件作为输入,它们看起来像这样: 测试数据是这样的:
(2,[9.26,68.19])
(1,[3.27,9.14])
火车数据是这样的:
[36.72,67.44]
[92.20,11.81]
现在spark-submit命令是这样的:
$SPARK_HOME/bin/spark-submit --master local[2] --class "App" ./target/scala-2.11/myapp_2.11-0.1.jar "./train" "./test" 5 10 2
我认为一切都很好,但是它在群集零中分配了每个测试数据。有什么想法吗?