nullpointerexception同时保存kmeansmodel

时间:2015-07-24 21:04:21

标签: scala apache-spark

我是Scala的新手并实施ML算法。我正在尝试从我的数据集中实现K-MeansModel。代码如下:

import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile(path to my dataset)
val pdata = data.map(line => line.replaceAll("\"", " "))
val parsedData = pdata.map(s => Vectors.dense(s.split(",").drop(5).take(5).map(_.toDouble))).cache()

// Cluster the data into two classes using KMeans
val numClusters = 3
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)

// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE)

// Save and load model
clusters.save(sc,"myModelPath")
val sameModel = KMeansModel.load(sc, "myModelPath")

示例数据集是

"1","AAH03JABiAAJKnPAa5","20120707","09:34:19","109","23.813900","90.398598"
"2","AAH03JAC4AAAcwTAQt","20120707","09:42:31","92","23.704201","90.429703"
"3","AAH03JAC4AAAcwhAVd","20120707","09:01:39","16","23.698900","90.435303"
"4","AAH03JAC4AAAcwhAVd","20120707","09:03:06","154","23.698900","90.435303"
"5","AAH03JAC7AAAcOtAFE","20120707","09:15:05","40","23.717501","90.471100"

最后两列是纬度和经度,我正在尝试使用这两列构建群集。

我能够在“clusters”变量中获得集群,并且我还能够打印集群中心和SSE。但是当我执行clusters.save时,我得到一个NullPointer异常。

scala> clusters.save(sc,"myModelPath")
15/07/24 15:53:08 INFO SparkContext: Starting job: saveAsTextFile at KMeansModel
.scala:109
15/07/24 15:53:08 INFO DAGScheduler: Got job 55 (saveAsTextFile at KMeansModel.s
cala:109) with 1 output partitions (allowLocal=false)
15/07/24 15:53:08 INFO DAGScheduler: Final stage: ResultStage 70(saveAsTextFile
at KMeansModel.scala:109)
15/07/24 15:53:08 INFO DAGScheduler: Parents of final stage: List()
15/07/24 15:53:08 INFO DAGScheduler: Missing parents: List()
15/07/24 15:53:08 INFO DAGScheduler: Submitting ResultStage 70 (MapPartitionsRDD
[123] at saveAsTextFile at KMeansModel.scala:109), which has no missing parents
15/07/24 15:53:08 INFO MemoryStore: ensureFreeSpace(126776) called with curMem=1
3470621, maxMem=278019440
15/07/24 15:53:08 INFO MemoryStore: Block broadcast_102 stored as values in memo
ry (estimated size 123.8 KB, free 252.2 MB)
15/07/24 15:53:08 INFO MemoryStore: ensureFreeSpace(42308) called with curMem=13
597397, maxMem=278019440
15/07/24 15:53:08 INFO MemoryStore: Block broadcast_102_piece0 stored as bytes i
n memory (estimated size 41.3 KB, free 252.1 MB)
15/07/24 15:53:08 INFO BlockManagerInfo: Added broadcast_102_piece0 in memory on
 localhost:52074 (size: 41.3 KB, free: 252.6 MB)
15/07/24 15:53:08 INFO SparkContext: Created broadcast 102 from broadcast at DAG
Scheduler.scala:874
15/07/24 15:53:08 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage
 70 (MapPartitionsRDD[123] at saveAsTextFile at KMeansModel.scala:109)
15/07/24 15:53:08 INFO TaskSchedulerImpl: Adding task set 70.0 with 1 tasks
15/07/24 15:53:08 INFO TaskSetManager: Starting task 0.0 in stage 70.0 (TID 140,
 localhost, PROCESS_LOCAL, 1453 bytes)
15/07/24 15:53:08 INFO Executor: Running task 0.0 in stage 70.0 (TID 140)
15/07/24 15:53:08 ERROR Executor: Exception in task 0.0 in stage 70.0 (TID 140)
java.lang.NullPointerException
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
        at org.apache.hadoop.util.Shell.run(Shell.java:455)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:656)
        at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:490)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:462)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:428)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
        at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputF
ormat.java:123)
        at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$
$anonfun$13.apply(PairRDDFunctions.scala:1104)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$
$anonfun$13.apply(PairRDDFunctions.scala:1095)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
        at java.lang.Thread.run(Thread.java:745)
15/07/24 15:53:08 WARN TaskSetManager: Lost task 0.0 in stage 70.0 (TID 140, loc
alhost): java.lang.NullPointerException
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
        at org.apache.hadoop.util.Shell.run(Shell.java:455)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:656)
        at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:490)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:462)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:428)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
        at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputF
ormat.java:123)
        at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$
$anonfun$13.apply(PairRDDFunctions.scala:1104)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$
$anonfun$13.apply(PairRDDFunctions.scala:1095)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
        at java.lang.Thread.run(Thread.java:745)

我不明白这里有什么不对。我正在使用spark-1.4.1-bin-hadoop2.6。

  1. 有人能告诉我为什么会遇到这个例外吗?
  2. 还有办法从3个不同数据集中的3个集群中保存数据,以便进一步使用,例如查询每个集群吗?

0 个答案:

没有答案