在spark集群scala中保存随机森林模型时出错

时间:2016-04-03 18:16:55

标签: scala apache-spark

我将随机林模型保存到磁盘时获得以下error。 火花群配置 - spark-package - spark-1.6.0-bin-hadoop2.6 模式 - 独立

我通过在每台从机中复制相同的数据来运行火花

command - localModel.save(SlapSparkContext.get(), path) 模型已经过培训,可以正确预测测试数据

error trace

  

显示java.lang.NullPointerException       在org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456)       在org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)       在org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)       在org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)       在org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)       在org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation $$ anonfun $ run $ 1.apply $ mcV $ sp(InsertIntoHadoopFsRelation.scala:151)       在org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation $$ anonfun $ run $ 1.apply(InsertIntoHadoopFsRelation.scala:108)       在org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation $$ anonfun $ run $ 1.apply(InsertIntoHadoopFsRelation.scala:108)       在org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:56)       在org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)       在org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult $ lzycompute(commands.scala:58)       在org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)       在org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)       在org.apache.spark.sql.execution.SparkPlan $$ anonfun $执行$ 5.apply(SparkPlan.scala:132)       在org.apache.spark.sql.execution.SparkPlan $$ anonfun $执行$ 5.apply(SparkPlan.scala:130)       在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:150)       在org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)       在org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:55)       在org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)       at org.apache.spark.sql.execution.datasources.ResolvedDataSource $ .apply(ResolvedDataSource.scala:256)       在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)       在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)       在org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)       在org.apache.spark.mllib.tree.model.TreeEnsembleModel $ SaveLoadV1_0 $ .save(treeEnsembleModels.scala:453)       在org.apache.spark.mllib.tree.model.RandomForestModel.save(treeEnsembleModels.scala:65)

1 个答案:

答案 0 :(得分:0)

当您尝试保存 Empty DataFrame时出现错误。检查此行代码之前的步骤是否正在过滤/减少您的记录。