虽然我正在从包含双精度值的目录中读取CSV数据,并按如下所示对其应用流K-means模型,
// CSV文件
40.729,-73.9422
40.7476,-73.9871
40.7424,-74.0044
40.751,-73.9869
40.7406,-73.9902
.....
// SBT依赖项:
name:=“应用程序名称”
版本:=“ 0.1”
scalaVersion:=“ 2.11.12”
val sparkVersion =“ 2.3.1”libraryDependencies ++ = Seq(
“ org.apache.spark” %%“ spark-core”%sparkVersion,
“ org.apache.spark”%“ spark-streaming_2.11”%sparkVersion,
“ org.apache.spark” %%“ spark-mllib”%“ 2.3.1”)
// import语句
导入org.apache.spark.sql。{DataFrame,SparkSession}
导入org.apache.spark.sql.streaming.OutputMode
导入org.apache.spark.sql.types._
导入org.apache.spark。{SparkConf,SparkContext,rdd}
导入org.apache.spark.streaming。{秒,StreamingContext}
导入org.apache.spark.mllib.clustering。{KMeans,StreamingKMeans}
导入org.apache.spark.mllib.linalg.Vectors
//读取Csv数据
val trainingData = ssc.textFileStream ("directory path") .map(x=>x.toDouble) .map(x=>Vectors.dense(x)) // applying Streaming kmeans model val model = new StreamingKMeans() .setK(numClusters) .setDecayFactor(1.0) .setRandomCenters(numDimensions, 0.0) model.trainOn(trainingData)
我收到以下错误:
18/07/24 11:20:04错误执行器:阶段2.0中的任务0.0中的异常 (工贸署 1) java.lang。 NumberFormatException :对于输入字符串:“ 40.7473,-73.9857”位于 sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) 在sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)处 java.lang.Double.parseDouble(Double.java:538)在 scala.collection.immutable.StringLike $ class.toDouble(StringLike.scala:285) 在scala.collection.immutable.StringOps.toDouble(StringOps.scala:29) 在ubu $$ anonfun $ 1.apply(uberclass.scala:305)在 ubu $$ anonfun $ 1.apply(uberclass.scala:305)在 scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:410)在 scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:410)在 scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:410)在 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193) 在 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) 在 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 在 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 在org.apache.spark.scheduler.Task.run(Task.scala:109)处 org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 在 java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624) 在java.lang.Thread.run(Thread.java:748)线程中的异常 “ streaming-job-executor-0” java.lang.Error: java.lang.InterruptedException在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155) 在 java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624) 在java.lang.Thread.run(Thread.java:748)
有人可以帮忙吗?
答案 0 :(得分:1)
存在尺寸问题。传递给流K均值模型的 vector 和 numDimension 的尺寸应该相同。