如何在RD-[Vector]中为K-Means算法转换合成控制数据集

时间:2016-06-07 10:32:56

标签: scala apache-spark bigdata

我正在尝试转换Uci机器学习中提供的“合成控制图时间序列数据集”。

数据集的外观是下一个..

28.7812 34.4632 31.3381 31.2834 28.9207 33.7596 25.3969 27.7849 35.2479 27.1159 32.8717 29.2171 36.0253 32.337  34.5249 32.8717 34.1173 26.5235 27.6623 26.3693 25.7744 29.27   30.7326 29.5054 33.0292 25.04   28.9167 24.3437 26.1203 34.9424 25.0293 26.6311 35.6541 28.4353 29.1495 28.1584 26.1927 33.3182 30.9772 27.0443 35.5344 26.2353 28.9964 32.0036 31.0558 34.2553 28.0721 28.9402 35.4973 29.747  31.4333 24.5556 33.7431 25.0466 34.9318 34.9879 32.4721 33.3759 25.4652 25.8717
24.8923 25.741  27.5532 32.8217 27.8789 31.5926 31.4861 35.5469 27.9516 31.6595 27.5415 31.1887 27.4867 31.391  27.811  24.488  27.5918 35.6273 35.4102 31.4167 30.7447 24.1311 35.1422 30.4719 31.9874 33.6615 25.5511 30.4686 33.6472 25.0701 34.0765 32.5981 28.3038 26.1471 26.9414 31.5203 33.1089 24.1491 28.5157 25.7906 35.9519 26.5301 24.8578 25.9562 32.8357 28.5322 26.3458 30.6213 28.9861 29.4047 32.5577 31.0205 26.6418 28.4331 33.6564 26.4244 28.4661 34.2484 32.1005 26.691
31.3987 30.6316 26.3983 24.2905 27.8613 28.5491 24.9717 32.4358 25.2239 27.3068 31.8387 27.2587 28.2572 26.5819 24.0455 35.0625 31.5717 32.5614 31.0308 34.1202 26.9337 31.4781 35.0173 32.3851 24.3323 30.2001 31.2452 26.6814 31.5137 28.8778 27.3086 24.246  26.9631 25.2919 31.6114 24.7131 27.4809 24.2075 26.8059 35.1253 32.6293 31.0561 26.3583 28.0861 31.4391 27.3057 29.6082 35.9725 34.1444 27.1717 33.6318 26.5966 25.5387 32.5434 25.5772 29.9897 31.351  33.9002 29.5446 29.343

数据存储在ASCII文件中,600行,60列,每行一个图表。每个块的编号用空格分隔,每个块用换行符分隔。我必须为行转换60个数字的所有行并将其存储在RDD [Vector]中。矢量上的所有位置必须有60个数字。 RDD [Vector]将具有以下外观......

[28.7812 34.4632 31.3381 31.2834 28.9207 33.7596 25.3969 27.7849 35.2479 27.1159 32.8717 29.2171 36.0253 32.337  34.5249 32.8717 34.1173 26.5235 27.6623 26.3693 25.7744 29.27   30.7326 29.5054 33.0292 25.04   28.9167 24.3437 26.1203 34.9424 25.0293 26.6311 35.6541 28.4353 29.1495 28.1584 26.1927 33.3182 30.9772 27.0443 35.5344 26.2353 28.9964 32.0036 31.0558 34.2553 28.0721 28.9402 35.4973 29.747  31.4333 24.5556 33.7431 25.0466 34.9318 34.9879 32.4721 33.3759 25.4652 25.8717]
[24.8923 25.741  27.5532 32.8217 27.8789 31.5926 31.4861 35.5469 27.9516 31.6595 27.5415 31.1887 27.4867 31.391  27.811  24.488  27.5918 35.6273 35.4102 31.4167 30.7447 24.1311 35.1422 30.4719 31.9874 33.6615 25.5511 30.4686 33.6472 25.0701 34.0765 32.5981 28.3038 26.1471 26.9414 31.5203 33.1089 24.1491 28.5157 25.7906 35.9519 26.5301 24.8578 25.9562 32.8357 28.5322 26.3458 30.6213 28.9861 29.4047 32.5577 31.0205 26.6418 28.4331 33.6564 26.4244 28.4661 34.2484 32.1005 26.691]

我正在尝试转换数据,但我有一个例外。代码就是这个......

val data = sc.textFile("/home/david/Desktop/synthetic.txt")
val parsedData = data.map(s => Vectors.dense(s.split("\n").map(_.toDouble))).cache()

当我运行K-Means算法时,我有一个例外..

val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)

这是例外

16/06/07 19:56:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)

java.lang.NumberFormatException: empty String
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1020)
    at java.lang.Double.parseDouble(Double.java:540)
    at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
    at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
    at org.test.spark.RunKMeans$$anonfun$1$$anonfun$apply$1.apply(RunKMeans.scala:22)
    at org.test.spark.RunKMeans$$anonfun$1$$anonfun$apply$1.apply(RunKMeans.scala:22)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
    at org.test.spark.RunKMeans$$anonfun$1.apply(RunKMeans.scala:22)
    at org.test.spark.RunKMeans$$anonfun$1.apply(RunKMeans.scala:22)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:283)
    at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

我该如何解决这个问题? 非常感谢你。

1 个答案:

答案 0 :(得分:0)

你要按\ n(新行)拆分,你应该按空格分割,并将每个元素转换为Double,然后才能将其密集为Mllib向量。

所以将s.split(“\ n”)更改为s.split(“”)或使用\ s而不是

========= EDIT =========

由于您要拆分多个空格,因此您应该使用:

 split("\\s+")

将拆分单个和多个空格。