流式K均值Spark Scala:获取输入字符串的java.lang.NumberFormatException

时间:2018-07-24 06:16:13

标签: scala streaming spark-streaming k-means

虽然我正在从包含双精度值的目录中读取CSV数据,并按如下所示对其应用流K-means模型,

// CSV文件

  

40.729,-73.9422
     40.7476,-73.9871
      40.7424,-74.0044
      40.751,-73.9869
      40.7406,-73.9902
      .....

// SBT依赖项:

  

name:=“应用程序名称”

     

版本:=“ 0.1”

     

scalaVersion:=“ 2.11.12”
    val sparkVersion =“ 2.3.1”

     

libraryDependencies ++ = Seq(
      “ org.apache.spark” %%“ spark-core”%sparkVersion,
        “ org.apache.spark”%“ spark-streaming_2.11”%sparkVersion,
       “ org.apache.spark” %%“ spark-mllib”%“ 2.3.1”)

// import语句

  

导入org.apache.spark.sql。{DataFrame,SparkSession}
     导入org.apache.spark.sql.streaming.OutputMode
     导入org.apache.spark.sql.types._
     导入org.apache.spark。{SparkConf,SparkContext,rdd}
     导入org.apache.spark.streaming。{秒,StreamingContext}
     导入org.apache.spark.mllib.clustering。{KMeans,StreamingKMeans}
     导入org.apache.spark.mllib.linalg.Vectors

//读取Csv数据

val trainingData = ssc.textFileStream ("directory path") 
                      .map(x=>x.toDouble)
                      .map(x=>Vectors.dense(x))
// applying Streaming kmeans model
val model = new StreamingKMeans()
  .setK(numClusters)
  .setDecayFactor(1.0)
  .setRandomCenters(numDimensions, 0.0)
model.trainOn(trainingData)

我收到以下错误:

  

18/07/24 11:20:04错误执行器:阶段2.0中的任务0.0中的异常   (工贸署       1)       java.lang。 NumberFormatException :对于输入字符串:“ 40.7473,-73.9857”位于       sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)     在sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)处   java.lang.Double.parseDouble(Double.java:538)在       scala.collection.immutable.StringLike $ class.toDouble(StringLike.scala:285)     在scala.collection.immutable.StringOps.toDouble(StringOps.scala:29)     在ubu $$ anonfun $ 1.apply(uberclass.scala:305)在   ubu $$ anonfun $ 1.apply(uberclass.scala:305)在   scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:410)在   scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:410)在   scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:410)在   org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)     在   org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)     在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)     在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)     在org.apache.spark.scheduler.Task.run(Task.scala:109)处   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)     在java.lang.Thread.run(Thread.java:748)线程中的异常   “ streaming-job-executor-0” java.lang.Error:   java.lang.InterruptedException在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)     在java.lang.Thread.run(Thread.java:748)

有人可以帮忙吗?

1 个答案:

答案 0 :(得分:1)

存在尺寸问题。传递给流K均值模型的 vector numDimension 尺寸应该相同。